Authors: "Anish Bhandari, Will Jones, Nicholas Sager"
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
# added the line just for testing - Anish Bhandari
# Required Libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(ggthemes)
library(caret)
library(janitor)
library(doParallel)
#library(e1071)
#library(class)
IntroductionΒΆ
In the dynamic field of data science, modeling serves as a vital tool for comprehending and predicting intricate relationships among variables. In this project, we will undertake a comprehensive exploration encompassing data processing, exploratory data analysis, and model construction. Our primary objective is to construct robust and reliable models that offer valuable insights and demonstrate accurate predictive capabilities. The initial model will prioritize interpretability, enabling us to extract meaningful explanations. Subsequently, we will develop two additional models that emphasize accurate predictions.
The video presentation for this project can be found at: https://youtu.be/_Rdo4PIEZZI
Data DescriptionΒΆ
Kaggle is used by data scientists and machine learning engineers to discover data, build models, and compete in challenges. One of the most popular competitions in Kaggle is 'House Prices - Advanced Regression Techniques'. As of 6/6/2023, this competition has close to 28K entries.
The Ames Housing dataset was compiled by Dean De Cock and can be found in the link below. There are 2 files - train.csv and a test.csv. Both the datasets have 79 explanatory variables. Sales price is the response variable which is present in train and absent in test. The train dataset has 1460 unique rows and test dataset has 1459 unique rows.For the purpose of our modeling exercise, we will solely utilize the train dataset. However, we have included the test dataset to facilitate the assessment of predictive performance using Kaggle prediction scores.
Dataset Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Read the DataΒΆ
# train <- read.csv("https://raw.githubusercontent.com/NickSager/DS_6372_Ames2/master/Data/train.csv")
# test<- read.csv("https://raw.githubusercontent.com/NickSager/DS_6372_Ames2/master/Data/test.csv")
train <- read.csv("Data/train.csv")
test<- read.csv("Data/test.csv")
# Merge the data frames and add a column indicating whether they come from the train or test set
train$train <- 1
test$SalePrice <- NA
test$train <- 0
ames <- rbind(train, test)
# Verify data frame
head(ames)
str(ames)
summary(ames)
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | β― | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | train | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | β― | <chr> | <chr> | <chr> | <int> | <int> | <int> | <chr> | <chr> | <int> | <dbl> | |
| 1 | 1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | β― | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 | 1 |
| 2 | 2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | β― | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 | 1 |
| 3 | 3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | β― | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 | 1 |
| 4 | 4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | β― | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 | 1 |
| 5 | 5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | β― | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 | 1 |
| 6 | 6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | β― | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 | 1 |
'data.frame': 2919 obs. of 82 variables: $ Id : int 1 2 3 4 5 6 7 8 9 10 ... $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ... $ MSZoning : chr "RL" "RL" "RL" "RL" ... $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ... $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ... $ Street : chr "Pave" "Pave" "Pave" "Pave" ... $ Alley : chr NA NA NA NA ... $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ... $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ... $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ... $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ... $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ... $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ... $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ... $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ... $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ... $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ... $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ... $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ... $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ... $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ... $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ... $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ... $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ... $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ... $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ... $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ... $ ExterQual : chr "Gd" "TA" "Gd" "TA" ... $ ExterCond : chr "TA" "TA" "TA" "TA" ... $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ... $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ... $ BsmtCond : chr "TA" "TA" "TA" "Gd" ... $ BsmtExposure : chr "No" "Gd" "Mn" "No" ... $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ... $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ... $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ... $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ... $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ... $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ... $ Heating : chr "GasA" "GasA" "GasA" "GasA" ... $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ... $ CentralAir : chr "Y" "Y" "Y" "Y" ... $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ... $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ... $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ... $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ... $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ... $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ... $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ... $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ... $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ... $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ... $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ... $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ... $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ... $ Functional : chr "Typ" "Typ" "Typ" "Typ" ... $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ... $ FireplaceQu : chr NA "TA" "TA" "Gd" ... $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ... $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ... $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ... $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ... $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ... $ GarageQual : chr "TA" "TA" "TA" "TA" ... $ GarageCond : chr "TA" "TA" "TA" "TA" ... $ PavedDrive : chr "Y" "Y" "Y" "Y" ... $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ... $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ... $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ... $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ... $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ... $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ... $ PoolQC : chr NA NA NA NA ... $ Fence : chr NA NA NA NA ... $ MiscFeature : chr NA NA NA NA ... $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ... $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ... $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ... $ SaleType : chr "WD" "WD" "WD" "WD" ... $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ... $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ... $ train : num 1 1 1 1 1 1 1 1 1 1 ...
Id MSSubClass MSZoning LotFrontage
Min. : 1.0 Min. : 20.00 Length:2919 Min. : 21.00
1st Qu.: 730.5 1st Qu.: 20.00 Class :character 1st Qu.: 59.00
Median :1460.0 Median : 50.00 Mode :character Median : 68.00
Mean :1460.0 Mean : 57.14 Mean : 69.31
3rd Qu.:2189.5 3rd Qu.: 70.00 3rd Qu.: 80.00
Max. :2919.0 Max. :190.00 Max. :313.00
NA's :486
LotArea Street Alley LotShape
Min. : 1300 Length:2919 Length:2919 Length:2919
1st Qu.: 7478 Class :character Class :character Class :character
Median : 9453 Mode :character Mode :character Mode :character
Mean : 10168
3rd Qu.: 11570
Max. :215245
LandContour Utilities LotConfig LandSlope
Length:2919 Length:2919 Length:2919 Length:2919
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Neighborhood Condition1 Condition2 BldgType
Length:2919 Length:2919 Length:2919 Length:2919
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
HouseStyle OverallQual OverallCond YearBuilt
Length:2919 Min. : 1.000 Min. :1.000 Min. :1872
Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
Mode :character Median : 6.000 Median :5.000 Median :1973
Mean : 6.089 Mean :5.565 Mean :1971
3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001
Max. :10.000 Max. :9.000 Max. :2010
YearRemodAdd RoofStyle RoofMatl Exterior1st
Min. :1950 Length:2919 Length:2919 Length:2919
1st Qu.:1965 Class :character Class :character Class :character
Median :1993 Mode :character Mode :character Mode :character
Mean :1984
3rd Qu.:2004
Max. :2010
Exterior2nd MasVnrType MasVnrArea ExterQual
Length:2919 Length:2919 Min. : 0.0 Length:2919
Class :character Class :character 1st Qu.: 0.0 Class :character
Mode :character Mode :character Median : 0.0 Mode :character
Mean : 102.2
3rd Qu.: 164.0
Max. :1600.0
NA's :23
ExterCond Foundation BsmtQual BsmtCond
Length:2919 Length:2919 Length:2919 Length:2919
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
Length:2919 Length:2919 Min. : 0.0 Length:2919
Class :character Class :character 1st Qu.: 0.0 Class :character
Mode :character Mode :character Median : 368.5 Mode :character
Mean : 441.4
3rd Qu.: 733.0
Max. :5644.0
NA's :1
BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:2919
1st Qu.: 0.00 1st Qu.: 220.0 1st Qu.: 793.0 Class :character
Median : 0.00 Median : 467.0 Median : 989.5 Mode :character
Mean : 49.58 Mean : 560.8 Mean :1051.8
3rd Qu.: 0.00 3rd Qu.: 805.5 3rd Qu.:1302.0
Max. :1526.00 Max. :2336.0 Max. :6110.0
NA's :1 NA's :1 NA's :1
HeatingQC CentralAir Electrical X1stFlrSF
Length:2919 Length:2919 Length:2919 Min. : 334
Class :character Class :character Class :character 1st Qu.: 876
Mode :character Mode :character Mode :character Median :1082
Mean :1160
3rd Qu.:1388
Max. :5095
X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
Min. : 0.0 Min. : 0.000 Min. : 334 Min. :0.0000
1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000
Median : 0.0 Median : 0.000 Median :1444 Median :0.0000
Mean : 336.5 Mean : 4.694 Mean :1501 Mean :0.4299
3rd Qu.: 704.0 3rd Qu.: 0.000 3rd Qu.:1744 3rd Qu.:1.0000
Max. :2065.0 Max. :1064.000 Max. :5642 Max. :3.0000
NA's :2
BsmtHalfBath FullBath HalfBath BedroomAbvGr
Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.00
1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.00
Median :0.00000 Median :2.000 Median :0.0000 Median :3.00
Mean :0.06136 Mean :1.568 Mean :0.3803 Mean :2.86
3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.00
Max. :2.00000 Max. :4.000 Max. :2.0000 Max. :8.00
NA's :2
KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
Min. :0.000 Length:2919 Min. : 2.000 Length:2919
1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
Median :1.000 Mode :character Median : 6.000 Mode :character
Mean :1.045 Mean : 6.452
3rd Qu.:1.000 3rd Qu.: 7.000
Max. :3.000 Max. :15.000
Fireplaces FireplaceQu GarageType GarageYrBlt
Min. :0.0000 Length:2919 Length:2919 Min. :1895
1st Qu.:0.0000 Class :character Class :character 1st Qu.:1960
Median :1.0000 Mode :character Mode :character Median :1979
Mean :0.5971 Mean :1978
3rd Qu.:1.0000 3rd Qu.:2002
Max. :4.0000 Max. :2207
NA's :159
GarageFinish GarageCars GarageArea GarageQual
Length:2919 Min. :0.000 Min. : 0.0 Length:2919
Class :character 1st Qu.:1.000 1st Qu.: 320.0 Class :character
Mode :character Median :2.000 Median : 480.0 Mode :character
Mean :1.767 Mean : 472.9
3rd Qu.:2.000 3rd Qu.: 576.0
Max. :5.000 Max. :1488.0
NA's :1 NA's :1
GarageCond PavedDrive WoodDeckSF OpenPorchSF
Length:2919 Length:2919 Min. : 0.00 Min. : 0.00
Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
Mode :character Mode :character Median : 0.00 Median : 26.00
Mean : 93.71 Mean : 47.49
3rd Qu.: 168.00 3rd Qu.: 70.00
Max. :1424.00 Max. :742.00
EnclosedPorch X3SsnPorch ScreenPorch PoolArea
Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.0 Median : 0.000 Median : 0.00 Median : 0.000
Mean : 23.1 Mean : 2.602 Mean : 16.06 Mean : 2.252
3rd Qu.: 0.0 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
Max. :1012.0 Max. :508.000 Max. :576.00 Max. :800.000
PoolQC Fence MiscFeature MiscVal
Length:2919 Length:2919 Length:2919 Min. : 0.00
Class :character Class :character Class :character 1st Qu.: 0.00
Mode :character Mode :character Mode :character Median : 0.00
Mean : 50.83
3rd Qu.: 0.00
Max. :17000.00
MoSold YrSold SaleType SaleCondition
Min. : 1.000 Min. :2006 Length:2919 Length:2919
1st Qu.: 4.000 1st Qu.:2007 Class :character Class :character
Median : 6.000 Median :2008 Mode :character Mode :character
Mean : 6.213 Mean :2008
3rd Qu.: 8.000 3rd Qu.:2009
Max. :12.000 Max. :2010
SalePrice train
Min. : 34900 Min. :0.0000
1st Qu.:129975 1st Qu.:0.0000
Median :163000 Median :1.0000
Mean :180921 Mean :0.5002
3rd Qu.:214000 3rd Qu.:1.0000
Max. :755000 Max. :1.0000
NA's :1459 For data cleaning purposes, we will merge test and train into one dataset, keeping in mind that the 1459 NA's in the SalePrice column are from the test set. We will also add a column to indicate whether the row is from the train or test set.
Data CleaningΒΆ
In order to use a linear regression model, we need to convert all of the categorical variables into dummy variables. We will also remove or impute the NA's in the data set.
# Summarize NA's by column
ames %>%
summarise_all(~(sum(is.na(.)))) %>%
gather(key = "Column", value = "NA_Count", -1) %>%
filter(NA_Count > 0) %>%
ggplot(aes(x = reorder(Column, NA_Count), y = NA_Count)) +
geom_col() +
coord_flip() +
theme_gdocs() +
labs(title = "Number of NA's by Column", x = "Column", y = "NA Count")
# Create a table of the missing NAs by column
ames %>%
summarise_all(~(sum(is.na(.)))) %>%
gather(key = "Column", value = "NA_Count", -1) %>%
filter(NA_Count > 0) %>%
arrange(desc(NA_Count)) %>%
select(-Id) # %>%
# kable()
library(naniar)
vis_miss(ames[c(2:40)],cluster = TRUE, sort_miss =TRUE)
vis_miss(ames[c(41:81)],cluster = TRUE, sort_miss = TRUE)
| Column | NA_Count |
|---|---|
| <chr> | <int> |
| PoolQC | 2909 |
| MiscFeature | 2814 |
| Alley | 2721 |
| Fence | 2348 |
| SalePrice | 1459 |
| FireplaceQu | 1420 |
| LotFrontage | 486 |
| GarageYrBlt | 159 |
| GarageFinish | 159 |
| GarageQual | 159 |
| GarageCond | 159 |
| GarageType | 157 |
| BsmtCond | 82 |
| BsmtExposure | 82 |
| BsmtQual | 81 |
| BsmtFinType2 | 80 |
| BsmtFinType1 | 79 |
| MasVnrType | 24 |
| MasVnrArea | 23 |
| MSZoning | 4 |
| Utilities | 2 |
| BsmtFullBath | 2 |
| BsmtHalfBath | 2 |
| Functional | 2 |
| Exterior1st | 1 |
| Exterior2nd | 1 |
| BsmtFinSF1 | 1 |
| BsmtFinSF2 | 1 |
| BsmtUnfSF | 1 |
| TotalBsmtSF | 1 |
| Electrical | 1 |
| KitchenQual | 1 |
| GarageCars | 1 |
| GarageArea | 1 |
| SaleType | 1 |
There are not too many NA's in the data set, and they appear mostly to do with lack of a certain feature. For example, if a house does not have a pool, then the PoolQC column will be NA.
# Imputation
# If pool-related variables are NA, assume there is no pool and assign to 0
ames <- ames %>%
mutate(
PoolQC = ifelse(is.na(PoolQC), "None", PoolQC),
PoolArea = ifelse(is.na(PoolArea), 0, PoolArea),
)
# If garage-related variables are NA, assume there is no garage and assign to 0
ames <- ames %>%
mutate(
GarageType = ifelse(is.na(GarageType), "None", GarageType),
GarageYrBlt = ifelse(is.na(GarageYrBlt), 1979, GarageYrBlt), #These will be changed to the mean because of large year values
GarageFinish = ifelse(is.na(GarageFinish), "None", GarageFinish),
GarageCars = ifelse(is.na(GarageCars), 0, GarageCars),
GarageArea = ifelse(is.na(GarageArea), 0, GarageArea),
GarageQual = ifelse(is.na(GarageQual), "None", GarageQual),
GarageCond = ifelse(is.na(GarageCond), "None", GarageCond)
)
# If Bsmt-related variables are NA, assume there is no Bsmt and assign to 0, Masvertype to 0, Utilities to All pub which is the most common, and Exterior to other
ames <- ames %>%
mutate(
BsmtQual = ifelse(is.na(BsmtQual), "None", BsmtQual),
BsmtCond = ifelse(is.na(BsmtCond), "None", BsmtCond),
BsmtExposure = ifelse(is.na(BsmtExposure), "None", BsmtExposure),
BsmtFinType1 = ifelse(is.na(BsmtFinType1), "None", BsmtFinType1),
BsmtFinSF1 = ifelse(is.na(BsmtFinSF1), 0, BsmtFinSF1),
BsmtFinType2 = ifelse(is.na(BsmtFinType2), "None", BsmtFinType2),
BsmtFinSF2 = ifelse(is.na(BsmtFinSF2), 0, BsmtFinSF2),
BsmtUnfSF = ifelse(is.na(BsmtUnfSF), 0, BsmtUnfSF),
BsmtFullBath = ifelse(is.na(BsmtFullBath), 0, BsmtFullBath),
BsmtHalfBath = ifelse(is.na(BsmtHalfBath), 0, BsmtHalfBath),
TotalBsmtSF = ifelse(is.na(TotalBsmtSF), 0, TotalBsmtSF),
LotFrontage = ifelse(is.na(LotFrontage), 0, LotFrontage),
MasVnrArea = ifelse(is.na(MasVnrArea), 0, MasVnrArea),
MasVnrType = ifelse(is.na(MasVnrType), "None", MasVnrType),
Utilities = ifelse(is.na(Utilities), "AllPub", Utilities),
Exterior1st = ifelse(is.na(Exterior1st), "Other", Exterior1st),
Exterior2nd = ifelse(is.na(Exterior2nd), "Other", Exterior2nd),
Electrical = ifelse(is.na(Electrical), "FuseA", Electrical),
)
# If Fence-related variables are NA, assume there is no Fence and assign to 0
ames <- ames %>%
mutate(
Fence = ifelse(is.na(Fence), "None", Fence),
)
# If Misc-related variables are NA, assume there is no Misc features and assign to 0
ames <- ames %>%
mutate(
MiscFeature = ifelse(is.na(MiscFeature), "None", MiscFeature),
)
# If Fireplace-related variables are NA, assume there is no Fireplace and assign to 0
ames <- ames %>%
mutate(
FireplaceQu = ifelse(is.na(FireplaceQu), "None", FireplaceQu),
)
# If Alley-related variables are NA, assume there is no Alley and assign to 0
ames <- ames %>%
mutate(
Alley = ifelse(is.na(Alley), "None", Alley),
)
# Summarize the amount of remaining NA's by column to check what's left
colSums(is.na(ames))
# create a dataset for eda named ameseda
ameseda <- ames[ames$train == 1, ]
# Use the dummyVars() function to convert categorical variables into dummy variables
# Then use janitor::clean_names() to clean up the column names
dummy_model <- dummyVars(~ ., data = ames)
ames_dummy <- as.data.frame(predict(dummy_model, newdata = ames))
ames_dummy <- clean_names(ames_dummy)
# NOTE: Probably could make the case for deleting NAs here -Nick
# Fill in all remaining na values with the mean of the column
ames_dummy <- ames_dummy %>%
mutate(across(
c(-sale_price) ,# , -train),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .)
))
# create ames dataset for modeling to be consistent with team member's terminology
ames <-ames_dummy
# Summary of missing values post imputation and changing into dummy. Sales Price from the 'test' dataset is the only column with missing values.
gg_miss_var(ames_dummy[,c(1:50)])
gg_miss_var(ames_dummy[,c(51:100)])
gg_miss_var(ames_dummy[,c(101:150)])
gg_miss_var(ames_dummy[,c(151:200)])
gg_miss_var(ames_dummy[,c(201:250)])
gg_miss_var(ames_dummy[,c(250:305)])
vis_miss(ameseda[c(2:40)],cluster = TRUE, sort_miss =TRUE)
vis_miss(ameseda[c(41:81)],cluster = TRUE, sort_miss = TRUE)
- Id
- 0
- MSSubClass
- 0
- MSZoning
- 4
- LotFrontage
- 0
- LotArea
- 0
- Street
- 0
- Alley
- 0
- LotShape
- 0
- LandContour
- 0
- Utilities
- 0
- LotConfig
- 0
- LandSlope
- 0
- Neighborhood
- 0
- Condition1
- 0
- Condition2
- 0
- BldgType
- 0
- HouseStyle
- 0
- OverallQual
- 0
- OverallCond
- 0
- YearBuilt
- 0
- YearRemodAdd
- 0
- RoofStyle
- 0
- RoofMatl
- 0
- Exterior1st
- 0
- Exterior2nd
- 0
- MasVnrType
- 0
- MasVnrArea
- 0
- ExterQual
- 0
- ExterCond
- 0
- Foundation
- 0
- BsmtQual
- 0
- BsmtCond
- 0
- BsmtExposure
- 0
- BsmtFinType1
- 0
- BsmtFinSF1
- 0
- BsmtFinType2
- 0
- BsmtFinSF2
- 0
- BsmtUnfSF
- 0
- TotalBsmtSF
- 0
- Heating
- 0
- HeatingQC
- 0
- CentralAir
- 0
- Electrical
- 0
- X1stFlrSF
- 0
- X2ndFlrSF
- 0
- LowQualFinSF
- 0
- GrLivArea
- 0
- BsmtFullBath
- 0
- BsmtHalfBath
- 0
- FullBath
- 0
- HalfBath
- 0
- BedroomAbvGr
- 0
- KitchenAbvGr
- 0
- KitchenQual
- 1
- TotRmsAbvGrd
- 0
- Functional
- 2
- Fireplaces
- 0
- FireplaceQu
- 0
- GarageType
- 0
- GarageYrBlt
- 0
- GarageFinish
- 0
- GarageCars
- 0
- GarageArea
- 0
- GarageQual
- 0
- GarageCond
- 0
- PavedDrive
- 0
- WoodDeckSF
- 0
- OpenPorchSF
- 0
- EnclosedPorch
- 0
- X3SsnPorch
- 0
- ScreenPorch
- 0
- PoolArea
- 0
- PoolQC
- 0
- Fence
- 0
- MiscFeature
- 0
- MiscVal
- 0
- MoSold
- 0
- YrSold
- 0
- SaleType
- 1
- SaleCondition
- 0
- SalePrice
- 1459
- train
- 0
Imputation:
Pool related variables: Upon investigation, we discovered that the missing data for pool-related variables followed a MNAR (Missing Not at Random) pattern, specifically in homes without pools. To address this, we replaced the missing values with "none" or 0, depending on the variable type.
Garage related variables: Our investigation revealed that the missing data for garage-related variables also followed a MNAR pattern, particularly in homes without garages. To handle this, we imputed the missing values with "none" or 0, depending on the variable type.
Basement related variables: Similar to the pool and garage variables, the missing data for basement-related variables displayed a MNAR pattern, primarily in homes without basements. We addressed this by replacing the missing values with "none" or 0, based on the variable type.
Additionally, categorical variables such as Fence, Fireplace, and Alley, which exhibited a MNAR pattern, were assigned the value "none."
For variables that followed a MCAR (Missing Completely at Random) pattern and had a relatively low number of missing values, we imputed the missing values with the mean of the variable.
The datasets after imputation and processing were split back to 'test' and 'train'.
Influential points in training data:
ames[524, ] %>% select(sale_price, gr_liv_area)
ames[1299, ] %>% select(sale_price, gr_liv_area)
# Remove the two outliers
ames <- ames[-c(524, 1299), ]
| sale_price | gr_liv_area | |
|---|---|---|
| <dbl> | <dbl> | |
| 524 | 184750 | 4676 |
| sale_price | gr_liv_area | |
|---|---|---|
| <dbl> | <dbl> | |
| 1299 | 160000 | 5642 |
Observations 524 and 1299 were identified as outliers based on their GrLivArea and SalePrice values. These observations were removed from the training dataset due to their very atypical values. We will assume that there is some other reason for these values which isn't accounted for in the data.
Exploratory Data AnalysisΒΆ
Moving forward, we will delve into an exploration of the Ames housing market data, aiming to extract valuable insights. By closely examining the dataset, we aim to uncover key patterns, trends, and relationships that will assist us to robust models.
Numerical Data Analysis IΒΆ
#ameseda_n is used for eda analysis on all numeric variables
ameseda_n <- ameseda %>%
select_if(function(x) is.numeric(x) || is.integer(x))
#library(gridExtra)
# Preperation values for ggplot
ames_long <- ameseda_n %>%
pivot_longer(everything(), names_to = "variable", values_to = "value")
# Set the plot size and aspect ratio
options(repr.plot.width = 10, repr.plot.height = 6)
# Divide the variables into 4 groups
# Group 1
group1 <- c( "MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd")
# Group 2
group2 <- c("MasVnrArea", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "LowQualFinSF")
# Group 3
group3 <- c("GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd")
# Group 4
group4 <- c("Fireplaces", "GarageYrBlt", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"X3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal", "MoSold", "YrSold", "SalePrice")
# Create plots for each group of variables
plot1 <- ames_long %>%
filter(variable %in% group1) %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme(axis.text.x = element_blank()) +
labs(title = "Boxplots - Group 1", x = "Variables", y = "Values")
plot2 <- ames_long %>%
filter(variable %in% group2) %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme(axis.text.x = element_blank()) +
labs(title = "Boxplots - Group 2", x = "Variables", y = "Values")
plot3 <- ames_long %>%
filter(variable %in% group3) %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme(axis.text.x = element_blank()) +
labs(title = "Boxplots - Group 3", x = "Variables", y = "Values")
plot4 <- ames_long %>%
filter(variable %in% group4) %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme(axis.text.x = element_blank()) +
labs(title = "Boxplots - Group 4", x = "Variables", y = "Values")
# Summary table on all numeric variables from dataset
library(psych)
describe(ameseda_n)
# Display plots
plot1
plot2
plot3
plot4
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| Id | 1 | 1460 | 7.305000e+02 | 4.216100e+02 | 730.5 | 7.305000e+02 | 541.1490 | 1 | 1460 | 1459 | 0.00000000 | -1.20246603 | 1.103404e+01 |
| MSSubClass | 2 | 1460 | 5.689726e+01 | 4.230057e+01 | 50.0 | 4.915240e+01 | 44.4780 | 20 | 190 | 170 | 1.40476562 | 1.56441572 | 1.107057e+00 |
| LotFrontage | 3 | 1460 | 5.762329e+01 | 3.466430e+01 | 63.0 | 5.793921e+01 | 25.2042 | 0 | 313 | 313 | 0.26727232 | 3.58518810 | 9.072063e-01 |
| LotArea | 4 | 1460 | 1.051683e+04 | 9.981265e+03 | 9478.5 | 9.563284e+03 | 2962.2348 | 1300 | 215245 | 213945 | 12.18261502 | 202.26232234 | 2.612216e+02 |
| OverallQual | 5 | 1460 | 6.099315e+00 | 1.382997e+00 | 6.0 | 6.079623e+00 | 1.4826 | 1 | 10 | 9 | 0.21649836 | 0.08762258 | 3.619467e-02 |
| OverallCond | 6 | 1460 | 5.575342e+00 | 1.112799e+00 | 5.0 | 5.477740e+00 | 0.0000 | 1 | 9 | 8 | 0.69164401 | 1.09290874 | 2.912329e-02 |
| YearBuilt | 7 | 1460 | 1.971268e+03 | 3.020290e+01 | 1973.0 | 1.974127e+03 | 37.0650 | 1872 | 2010 | 138 | -0.61220121 | -0.44565754 | 7.904461e-01 |
| YearRemodAdd | 8 | 1460 | 1.984866e+03 | 2.064541e+01 | 1994.0 | 1.986369e+03 | 19.2738 | 1950 | 2010 | 60 | -0.50252776 | -1.27436545 | 5.403150e-01 |
| MasVnrArea | 9 | 1460 | 1.031171e+02 | 1.807314e+02 | 0.0 | 6.254110e+01 | 0.0000 | 0 | 1600 | 1600 | 2.67211701 | 10.08466917 | 4.729956e+00 |
| BsmtFinSF1 | 10 | 1460 | 4.436397e+02 | 4.560981e+02 | 383.5 | 3.860762e+02 | 568.5771 | 0 | 5644 | 5644 | 1.68204129 | 11.05681415 | 1.193663e+01 |
| BsmtFinSF2 | 11 | 1460 | 4.654932e+01 | 1.613193e+02 | 0.0 | 1.382705e+00 | 0.0000 | 0 | 1474 | 1474 | 4.24652141 | 20.00886409 | 4.221918e+00 |
| BsmtUnfSF | 12 | 1460 | 5.672404e+02 | 4.418670e+02 | 477.5 | 5.192885e+02 | 426.9888 | 0 | 2336 | 2336 | 0.91837835 | 0.46451129 | 1.156419e+01 |
| TotalBsmtSF | 13 | 1460 | 1.057429e+03 | 4.387053e+02 | 991.5 | 1.036695e+03 | 347.6697 | 0 | 6110 | 6110 | 1.52112395 | 13.17885602 | 1.148144e+01 |
| X1stFlrSF | 14 | 1460 | 1.162627e+03 | 3.865877e+02 | 1087.0 | 1.129991e+03 | 347.6697 | 334 | 4692 | 4358 | 1.37392896 | 5.71013207 | 1.011746e+01 |
| X2ndFlrSF | 15 | 1460 | 3.469925e+02 | 4.365284e+02 | 0.0 | 2.853639e+02 | 0.0000 | 0 | 2065 | 2065 | 0.81135997 | -0.55902397 | 1.142447e+01 |
| LowQualFinSF | 16 | 1460 | 5.844521e+00 | 4.862308e+01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 572 | 572 | 8.99283329 | 82.82823852 | 1.272524e+00 |
| GrLivArea | 17 | 1460 | 1.515464e+03 | 5.254804e+02 | 1464.0 | 1.467670e+03 | 483.3276 | 334 | 5642 | 5308 | 1.36375364 | 4.86348279 | 1.375245e+01 |
| BsmtFullBath | 18 | 1460 | 4.253425e-01 | 5.189106e-01 | 0.0 | 3.921233e-01 | 0.0000 | 0 | 3 | 3 | 0.59484237 | -0.84329160 | 1.358051e-02 |
| BsmtHalfBath | 19 | 1460 | 5.753425e-02 | 2.387526e-01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 2 | 2 | 4.09497490 | 16.30995691 | 6.248442e-03 |
| FullBath | 20 | 1460 | 1.565068e+00 | 5.509158e-01 | 2.0 | 1.560788e+00 | 0.0000 | 0 | 3 | 3 | 0.03648647 | -0.86115028 | 1.441813e-02 |
| HalfBath | 21 | 1460 | 3.828767e-01 | 5.028854e-01 | 0.0 | 3.433219e-01 | 0.0000 | 0 | 2 | 2 | 0.67450925 | -1.07998235 | 1.316111e-02 |
| BedroomAbvGr | 22 | 1460 | 2.866438e+00 | 8.157780e-01 | 3.0 | 2.852740e+00 | 0.0000 | 0 | 8 | 8 | 0.21135511 | 2.21198810 | 2.134989e-02 |
| KitchenAbvGr | 23 | 1460 | 1.046575e+00 | 2.203382e-01 | 1.0 | 1.000000e+00 | 0.0000 | 0 | 3 | 3 | 4.47917826 | 21.42113861 | 5.766514e-03 |
| TotRmsAbvGrd | 24 | 1460 | 6.517808e+00 | 1.625393e+00 | 6.0 | 6.408390e+00 | 1.4826 | 2 | 14 | 12 | 0.67495173 | 0.86833683 | 4.253849e-02 |
| Fireplaces | 25 | 1460 | 6.130137e-01 | 6.446664e-01 | 1.0 | 5.342466e-01 | 1.4826 | 0 | 3 | 3 | 0.64823107 | -0.22440683 | 1.687169e-02 |
| GarageYrBlt | 26 | 1460 | 1.978534e+03 | 2.399485e+01 | 1979.0 | 1.980998e+03 | 29.6520 | 1900 | 2010 | 110 | -0.67020290 | -0.27050466 | 6.279739e-01 |
| GarageCars | 27 | 1460 | 1.767123e+00 | 7.473150e-01 | 2.0 | 1.773973e+00 | 0.0000 | 0 | 4 | 4 | -0.34184538 | 0.21173072 | 1.955813e-02 |
| GarageArea | 28 | 1460 | 4.729801e+02 | 2.138048e+02 | 480.0 | 4.698082e+02 | 177.9120 | 0 | 1418 | 1418 | 0.17961125 | 0.90446871 | 5.595528e+00 |
| WoodDeckSF | 29 | 1460 | 9.424452e+01 | 1.253388e+02 | 0.0 | 7.175771e+01 | 0.0000 | 0 | 857 | 857 | 1.53820999 | 2.97041708 | 3.280266e+00 |
| OpenPorchSF | 30 | 1460 | 4.666027e+01 | 6.625603e+01 | 25.0 | 3.323288e+01 | 37.0650 | 0 | 547 | 547 | 2.35948572 | 8.44149101 | 1.733999e+00 |
| EnclosedPorch | 31 | 1460 | 2.195411e+01 | 6.111915e+01 | 0.0 | 3.866438e+00 | 0.0000 | 0 | 552 | 552 | 3.08352575 | 10.37263409 | 1.599561e+00 |
| X3SsnPorch | 32 | 1460 | 3.409589e+00 | 2.931733e+01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 508 | 508 | 10.28317840 | 123.06231159 | 7.672696e-01 |
| ScreenPorch | 33 | 1460 | 1.506096e+01 | 5.575742e+01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 480 | 480 | 4.11374731 | 18.34260759 | 1.459238e+00 |
| PoolArea | 34 | 1460 | 2.758904e+00 | 4.017731e+01 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 738 | 738 | 14.79791829 | 222.19170782 | 1.051488e+00 |
| MiscVal | 35 | 1460 | 4.348904e+01 | 4.961230e+02 | 0.0 | 0.000000e+00 | 0.0000 | 0 | 15500 | 15500 | 24.42652237 | 697.64007214 | 1.298413e+01 |
| MoSold | 36 | 1460 | 6.321918e+00 | 2.703626e+00 | 6.0 | 6.252568e+00 | 2.9652 | 1 | 12 | 11 | 0.21161746 | -0.41038457 | 7.075713e-02 |
| YrSold | 37 | 1460 | 2.007816e+03 | 1.328095e+00 | 2008.0 | 2.007770e+03 | 1.4826 | 2006 | 2010 | 4 | 0.09607079 | -1.19311159 | 3.475784e-02 |
| SalePrice | 38 | 1460 | 1.809212e+05 | 7.944250e+04 | 163000.0 | 1.707833e+05 | 56338.8000 | 34900 | 755000 | 720100 | 1.87900860 | 6.49678933 | 2.079105e+03 |
| train | 39 | 1460 | 1.000000e+00 | 0.000000e+00 | 1.0 | 1.000000e+00 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.000000e+00 |
Upon analyzing the boxplots and summary table, we observe that the majority of the numerical variables exhibit a right-skewed distribution. However, a few variables, namely YearBuilt, YearRemodAdd, GarageYrBlt, and GarageCars, display a left-skewed distribution. Our response variable, SalePrice, also demonstrates a right-skewed distribution and reveals the presence of outliers.
Categorical Data Anaylsis IΒΆ
#creating a dataset for all categorical variables
ameseda_c <- ameseda %>%
select_if(function(x) is.character(x))
# converting all variables into factor
ameseda_c <- ameseda_c %>% mutate_all(as.factor)
ameseda_c <- ameseda_c %>%
mutate(SalePrice = ameseda_n$SalePrice)
# Assuming your dataset is stored in the variable 'dataset'
dataset <- ameseda_c
response_variable <- ameseda_c$SalePrice
# Assuming your dataset is stored in the variable 'dataset'
response_variable <- "SalePrice"
# Define the categorical variables (replace with the provided variable names)
categorical_variables <- c("MSZoning", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope",
"Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl",
"Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual",
"BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir",
"Electrical", "KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageFinish",
"GarageQual", "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType",
"SaleCondition")
# Create a list to store the plots
plots <- list()
# Loop through the categorical variables and create a histogram for each
for (variable in categorical_variables) {
plot <- ggplot(dataset, aes_string(x = response_variable, fill = variable)) +
geom_histogram(color = "black", bins = 30) +
labs(title = paste("Histogram of", response_variable, "-", variable),
x = response_variable, fill = variable) +
theme_bw()
plots[[variable]] <- plot
}
# Display the plots
for (variable in categorical_variables) {
print(plots[[variable]])
}
# Loop through the categorical variables and create a scatter plot for each
for (variable in categorical_variables) {
plot <- ggplot(dataset, aes_string(x = response_variable, y = variable, color = variable)) +
geom_point() +
labs(title = paste("Scatter Plot of", response_variable, "vs", variable),
x = response_variable, y = variable, color = variable) +
theme_bw()
print(plot)
}
# added this to summarize
#library(psych)
#describe(ameseda_n)
Based on the histograms and scatter plots of the Sales Price when separated by categorical variables, we can identify the following variables that potentially have a good distribution and may be favorable for modeling the Sales Price: MSZoning, RoofStyle, Exterior1st, Exterior2nd, LotShape, LandContour, LotConfig, Neighborhood, BldgType, HouseStyle, HeatingQC, CentralAir, KitchenQual, FireplaceQu, GarageType, GarageFinish, PavedDrive, SaleType, SaleCondition, Condition1, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtFinType1, Electrical, Functional, GarageQual, GarageCond.
Conversely, the following variables are less likely to be useful in modeling the Sales Price: Street, Alley, Utilities, LandSlope, Condition2, RoofMatl, PoolQC, Fence, MiscFeature, BsmtExposure, BsmtFinType2.These variables may not provide significant insights or exhibit a clear relationship with the Sales Price.
Numerical Data Analysis IIΒΆ
# create correlation plot for the numerical variables
library(corrplot)
corrplot(cor(ameseda_n),tl.cex = 0.6)
# ggpairs based on the corelation plot. We didn't plot every single numerical variable. We chose the ones that had high corelation with SalePrice from the correlation plot
library(GGally)
library(dplyr)
lowerFn <- function(data, mapping, method = "lm", ...) {
p <- ggplot(data = data, mapping = mapping) +
geom_point(colour = "blue", size = .2) +
geom_smooth(method = loess, color = "red", ...)
p
}
# First plot with selected variables
ameseda_n %>%
select(SalePrice, OverallQual, LotArea, YearBuilt, GrLivArea) %>%
ggpairs(lower = list(continuous = lowerFn))
# Second plot with selected variables
ameseda_n %>%
select(SalePrice, YearRemodAdd, TotalBsmtSF, X1stFlrSF, LowQualFinSF) %>%
ggpairs(lower = list(continuous = lowerFn))
# Third plot with selected variables
ameseda_n %>%
select(SalePrice, FullBath, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea) %>%
ggpairs(lower = list(continuous = lowerFn))
library(car)
vif(lm(SalePrice ~ OverallQual + YearBuilt + GrLivArea+YearRemodAdd+ X1stFlrSF+TotalBsmtSF+FullBath+TotRmsAbvGrd+GarageCars+Fireplaces+GarageYrBlt + GarageArea + LotArea, data=ameseda_n))
Warning message in cor(ameseda_n): βthe standard deviation is zeroβ `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 1.9766e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 4.0602β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 1.9766e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 4.0602β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 1.9766e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 4.0602β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 1.9766e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 4.0602β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 1.9766e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 4.0602β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 1.9766e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 4.0602β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 2.1555e-29β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 2.1555e-29β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 1.9766e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 4.0602β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 1.9766e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 4.0602β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 2.1555e-29β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 2.1555e-29β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 1.9766e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 4.0602β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 1.9766e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 4.0602β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1.015β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 2.1555e-29β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1.015β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 2.1555e-29β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at -0.02β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 2.02β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 2.035e-15β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at -0.02β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 2.02β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 2.035e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β
- OverallQual
- 2.86660642324521
- YearBuilt
- 3.39581864945437
- GrLivArea
- 5.30810776907409
- YearRemodAdd
- 1.89980266530303
- X1stFlrSF
- 3.78272037109803
- TotalBsmtSF
- 3.63026355194396
- FullBath
- 2.27318886436909
- TotRmsAbvGrd
- 3.37884959744182
- GarageCars
- 5.38364319877975
- Fireplaces
- 1.47442743614239
- GarageYrBlt
- 3.08132167667782
- GarageArea
- 5.18777544286162
- LotArea
- 1.17441230848542
Based on the correlation plot, we observed a relationship between the response variable SalePrice and several predictor variables. It is noteworthy that the relationship appears to be quadratic in most cases, which could be influenced by the presence of high SalePrice values. Among the variables, SalePrice exhibits strong correlations with the following variables: OverallQual (0.791), GrLivArea (0.709),GarageArea(0.623) GarageCars (0.640), X1stFlrSF (0.606), TotalBsmtSF (0.614), FullBath (0.561), TotalRmsAbvGrd (0.534), YearBuilt (0.523), YearRemodAdd (0.507), and GarageYrBlt (mild correlation).
The variable "LotArea" shows a correlation coefficient of 0.264 with the response variable. However, it is important to note that this correlation may be influenced by the presence of outliers with unusually high lot area values. These outliers can have a significant impact on the correlation coefficient, potentially inflating or deflating its magnitude.
Therefore, it is necessary to exercise caution when interpreting the correlation between "LotArea" and the response variable. Further analysis and consideration of the data, including the examination of outliers and their potential influence, would provide a more accurate understanding of the relationship between lot area and the response variable.
To assess multicollinearity among these correlated variables, we performed a multicollinearity analysis. Among them, GrLivArea, GarageCars, and GarageArea exhibited a borderline level of collinearity with a VIF of around 5.
It is important to consider these correlations and collinearity issues when modeling the SalePrice variable. Further analysis and modeling techniques may be necessary to address the quadratic relationships and potential collinearity effects in order to build an accurate predictive model.
Categorical Data Analysis IIΒΆ
# Fit a linear model with categorical variables to validate our visual findings
# Use the dummyVars() function to convert categorical variables into dummy variables
# Then use janitor::clean_names() to clean up the column names
dummy_model <- dummyVars(~ ., data = ameseda_c)
ames_dummy <- as.data.frame(predict(dummy_model, newdata = ameseda_c))
ames_dummy <- clean_names(ames_dummy)
options(max.print = 2000)
cbind(ameseda_n$SalePrice, ames_dummy) %>% #str()
lm(ameseda_n$SalePrice ~ ., data = .) %>%
summary()
Call:
lm(formula = ameseda_n$SalePrice ~ ., data = .)
Residuals:
Min 1Q Median 3Q Max
-1.725e-09 -5.060e-11 4.900e-12 5.470e-11 1.198e-08
Coefficients: (50 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.850e-09 1.088e-09 1.700e+00 0.089365 .
`ameseda_n$SalePrice` 1.000e+00 3.283e-16 3.046e+15 < 2e-16 ***
ms_zoning_c_all -4.339e-11 1.543e-10 -2.810e-01 0.778670
ms_zoning_fv 2.621e-11 1.223e-10 2.140e-01 0.830385
ms_zoning_rh -1.298e-11 1.226e-10 -1.060e-01 0.915720
ms_zoning_rl -9.447e-12 6.043e-11 -1.560e-01 0.875789
ms_zoning_rm NA NA NA NA
street_grvl -3.317e-12 1.906e-10 -1.700e-02 0.986118
street_pave NA NA NA NA
alley_grvl -2.372e-11 9.685e-11 -2.450e-01 0.806554
alley_none -2.251e-11 7.612e-11 -2.960e-01 0.767470
alley_pave NA NA NA NA
lot_shape_ir1 -4.272e-11 2.601e-11 -1.643e+00 0.100736
lot_shape_ir2 -9.032e-11 6.930e-11 -1.303e+00 0.192668
lot_shape_ir3 -4.377e-11 1.399e-10 -3.130e-01 0.754519
lot_shape_reg NA NA NA NA
land_contour_bnk -6.947e-13 5.971e-11 -1.200e-02 0.990719
land_contour_hls -2.948e-11 6.525e-11 -4.520e-01 0.651514
land_contour_low -6.675e-11 8.968e-11 -7.440e-01 0.456865
land_contour_lvl NA NA NA NA
utilities_all_pub -1.786e-10 4.115e-10 -4.340e-01 0.664328
utilities_no_se_wa NA NA NA NA
lot_config_corner 6.613e-12 2.844e-11 2.330e-01 0.816173
lot_config_cul_d_sac -6.085e-13 4.814e-11 -1.300e-02 0.989918
lot_config_fr2 2.222e-10 6.069e-11 3.661e+00 0.000262 ***
lot_config_fr3 3.949e-11 2.028e-10 1.950e-01 0.845653
lot_config_inside NA NA NA NA
land_slope_gtl -4.967e-11 1.521e-10 -3.270e-01 0.744063
land_slope_mod -8.144e-11 1.517e-10 -5.370e-01 0.591405
land_slope_sev NA NA NA NA
neighborhood_blmngtn -1.316e-09 1.664e-10 -7.907e+00 5.79e-15 ***
neighborhood_blueste -1.477e-09 3.165e-10 -4.667e+00 3.39e-06 ***
neighborhood_br_dale -1.283e-09 1.870e-10 -6.859e+00 1.09e-11 ***
neighborhood_brk_side -1.407e-09 1.506e-10 -9.342e+00 < 2e-16 ***
neighborhood_clear_cr -1.349e-09 1.525e-10 -8.847e+00 < 2e-16 ***
neighborhood_collg_cr -1.394e-09 1.329e-10 -1.048e+01 < 2e-16 ***
neighborhood_crawfor -1.344e-09 1.424e-10 -9.441e+00 < 2e-16 ***
neighborhood_edwards -1.409e-09 1.376e-10 -1.024e+01 < 2e-16 ***
neighborhood_gilbert -1.385e-09 1.386e-10 -9.993e+00 < 2e-16 ***
neighborhood_idotrr -1.387e-09 1.693e-10 -8.194e+00 6.23e-16 ***
neighborhood_meadow_v -1.347e-09 1.862e-10 -7.233e+00 8.24e-13 ***
neighborhood_mitchel -1.390e-09 1.417e-10 -9.809e+00 < 2e-16 ***
neighborhood_n_ames -1.424e-09 1.322e-10 -1.077e+01 < 2e-16 ***
neighborhood_no_ridge -1.338e-09 1.444e-10 -9.263e+00 < 2e-16 ***
neighborhood_n_pk_vill -1.302e-09 2.326e-10 -5.599e+00 2.66e-08 ***
neighborhood_nridg_ht -1.345e-09 1.406e-10 -9.564e+00 < 2e-16 ***
neighborhood_nw_ames -1.407e-09 1.356e-10 -1.038e+01 < 2e-16 ***
neighborhood_old_town -1.388e-09 1.524e-10 -9.113e+00 < 2e-16 ***
neighborhood_sawyer -1.401e-09 1.373e-10 -1.021e+01 < 2e-16 ***
neighborhood_sawyer_w -1.348e-09 1.366e-10 -9.871e+00 < 2e-16 ***
neighborhood_somerst -1.385e-09 1.575e-10 -8.791e+00 < 2e-16 ***
neighborhood_stone_br -1.272e-09 1.533e-10 -8.299e+00 2.71e-16 ***
neighborhood_swisu -1.399e-09 1.597e-10 -8.761e+00 < 2e-16 ***
neighborhood_timber -1.391e-09 1.447e-10 -9.613e+00 < 2e-16 ***
neighborhood_veenker NA NA NA NA
condition1_artery 6.544e-11 2.081e-10 3.140e-01 0.753205
condition1_feedr 2.101e-10 2.012e-10 1.044e+00 0.296637
condition1_norm 6.196e-11 1.964e-10 3.150e-01 0.752446
condition1_pos_a 6.213e-11 2.442e-10 2.540e-01 0.799208
condition1_pos_n 1.303e-10 2.188e-10 5.960e-01 0.551612
condition1_rr_ae 4.059e-11 2.362e-10 1.720e-01 0.863587
condition1_rr_an 5.550e-11 2.062e-10 2.690e-01 0.787868
condition1_rr_ne 1.127e-10 3.367e-10 3.350e-01 0.737971
condition1_rr_nn NA NA NA NA
condition2_artery 1.607e-10 4.332e-10 3.710e-01 0.710750
condition2_feedr 2.423e-10 3.403e-10 7.120e-01 0.476644
condition2_norm 1.583e-10 2.877e-10 5.500e-01 0.582287
condition2_pos_a 3.717e-10 5.900e-10 6.300e-01 0.528804
condition2_pos_n 6.521e-11 4.153e-10 1.570e-01 0.875258
condition2_rr_ae 1.349e-11 7.812e-10 1.700e-02 0.986227
condition2_rr_an 6.096e-11 4.793e-10 1.270e-01 0.898819
condition2_rr_nn NA NA NA NA
bldg_type_1fam 1.025e-10 5.882e-11 1.742e+00 0.081682 .
bldg_type_2fm_con 6.272e-11 1.017e-10 6.160e-01 0.537681
bldg_type_duplex 9.062e-11 9.228e-11 9.820e-01 0.326247
bldg_type_twnhs 6.505e-12 8.128e-11 8.000e-02 0.936231
bldg_type_twnhs_e NA NA NA NA
house_style_1_5fin 1.097e-10 7.102e-11 1.544e+00 0.122812
house_style_1_5unf 8.286e-11 1.337e-10 6.200e-01 0.535420
house_style_1story 1.289e-10 5.963e-11 2.161e+00 0.030898 *
house_style_2_5fin 1.186e-10 1.698e-10 6.990e-01 0.484855
house_style_2_5unf 6.740e-11 1.513e-10 4.450e-01 0.656171
house_style_2story 1.030e-10 6.167e-11 1.670e+00 0.095200 .
house_style_s_foyer 5.239e-11 8.696e-11 6.020e-01 0.546979
house_style_s_lvl NA NA NA NA
roof_style_flat -1.234e-10 5.512e-10 -2.240e-01 0.822858
roof_style_gable -1.397e-11 4.725e-10 -3.000e-02 0.976424
roof_style_gambrel -2.528e-12 4.869e-10 -5.000e-03 0.995859
roof_style_hip -3.701e-11 4.727e-10 -7.800e-02 0.937606
roof_style_mansard 8.283e-11 4.682e-10 1.770e-01 0.859597
roof_style_shed NA NA NA NA
roof_matl_cly_tile -2.269e-10 5.729e-10 -3.960e-01 0.692175
roof_matl_comp_shg 2.432e-10 1.838e-10 1.323e+00 0.185935
roof_matl_membran 2.664e-10 5.535e-10 4.810e-01 0.630445
roof_matl_metal 3.106e-10 5.294e-10 5.870e-01 0.557604
roof_matl_roll 1.357e-10 4.524e-10 3.000e-01 0.764319
roof_matl_tar_grv 4.095e-10 3.425e-10 1.195e+00 0.232167
roof_matl_wd_shake 2.492e-10 3.000e-10 8.310e-01 0.406373
roof_matl_wd_shngl NA NA NA NA
exterior1st_asb_shng 1.321e-10 2.141e-10 6.170e-01 0.537418
exterior1st_asph_shn 1.447e-10 5.128e-10 2.820e-01 0.777812
exterior1st_brk_comm 5.179e-11 4.169e-10 1.240e-01 0.901162
exterior1st_brk_face 6.385e-11 1.253e-10 5.090e-01 0.610555
exterior1st_c_block -4.819e-12 4.329e-10 -1.100e-02 0.991120
exterior1st_cemnt_bd 1.359e-11 2.516e-10 5.400e-02 0.956940
exterior1st_hd_board 2.082e-11 1.157e-10 1.800e-01 0.857309
exterior1st_im_stucc 2.368e-11 4.264e-10 5.600e-02 0.955727
exterior1st_metal_sd 5.928e-12 1.600e-10 3.700e-02 0.970453
exterior1st_plywood -5.699e-11 1.161e-10 -4.910e-01 0.623485
exterior1st_stone 1.023e-10 3.409e-10 3.000e-01 0.764215
exterior1st_stucco 8.690e-11 1.573e-10 5.530e-01 0.580654
exterior1st_vinyl_sd 9.400e-11 1.426e-10 6.590e-01 0.509769
exterior1st_wd_sdng 3.748e-11 1.071e-10 3.500e-01 0.726564
exterior1st_wd_shing NA NA NA NA
exterior2nd_asb_shng -5.986e-11 2.007e-10 -2.980e-01 0.765560
exterior2nd_asph_shn -6.496e-11 3.132e-10 -2.070e-01 0.835744
exterior2nd_brk_cmn -3.677e-11 2.775e-10 -1.330e-01 0.894608
exterior2nd_brk_face -1.284e-10 1.373e-10 -9.350e-01 0.350029
exterior2nd_c_block NA NA NA NA
exterior2nd_cment_bd 3.845e-11 2.442e-10 1.570e-01 0.874911
exterior2nd_hd_board 1.383e-11 1.084e-10 1.280e-01 0.898455
exterior2nd_im_stucc 4.810e-12 1.631e-10 3.000e-02 0.976470
exterior2nd_metal_sd 1.359e-10 1.542e-10 8.810e-01 0.378340
exterior2nd_other -2.444e-10 4.106e-10 -5.950e-01 0.551744
exterior2nd_plywood -2.470e-11 1.030e-10 -2.400e-01 0.810514
exterior2nd_stone -3.131e-11 2.167e-10 -1.440e-01 0.885147
exterior2nd_stucco 1.304e-11 1.493e-10 8.700e-02 0.930394
exterior2nd_vinyl_sd -4.176e-11 1.268e-10 -3.290e-01 0.741953
exterior2nd_wd_sdng 1.550e-11 9.321e-11 1.660e-01 0.867991
exterior2nd_wd_shng NA NA NA NA
mas_vnr_type_brk_cmn -2.031e-11 1.163e-10 -1.750e-01 0.861403
mas_vnr_type_brk_face -1.044e-11 4.595e-11 -2.270e-01 0.820356
mas_vnr_type_none -2.179e-12 4.747e-11 -4.600e-02 0.963388
mas_vnr_type_stone NA NA NA NA
exter_qual_ex -3.434e-11 8.547e-11 -4.020e-01 0.687945
exter_qual_fa 1.554e-11 1.529e-10 1.020e-01 0.919036
exter_qual_gd -7.770e-11 3.926e-11 -1.979e+00 0.048027 *
exter_qual_ta NA NA NA NA
exter_cond_ex -3.469e-11 2.765e-10 -1.250e-01 0.900207
exter_cond_fa -2.707e-11 9.296e-11 -2.910e-01 0.770945
exter_cond_gd -5.852e-11 3.778e-11 -1.549e+00 0.121601
exter_cond_po 7.901e-11 4.224e-10 1.870e-01 0.851667
exter_cond_ta NA NA NA NA
foundation_brk_til -5.724e-12 2.344e-10 -2.400e-02 0.980520
foundation_c_block 2.799e-11 2.314e-10 1.210e-01 0.903760
foundation_p_conc -3.838e-11 2.300e-10 -1.670e-01 0.867498
foundation_slab 1.128e-11 2.772e-10 4.100e-02 0.967558
foundation_stone 2.348e-11 2.893e-10 8.100e-02 0.935319
foundation_wood NA NA NA NA
bsmt_qual_ex 3.303e-11 6.540e-11 5.050e-01 0.613580
bsmt_qual_fa -1.630e-11 7.812e-11 -2.090e-01 0.834730
bsmt_qual_gd 5.814e-11 3.957e-11 1.469e+00 0.142013
bsmt_qual_none -4.905e-11 5.484e-10 -8.900e-02 0.928756
bsmt_qual_ta NA NA NA NA
bsmt_cond_fa -4.492e-12 6.786e-11 -6.600e-02 0.947236
bsmt_cond_gd 2.126e-11 5.195e-11 4.090e-01 0.682375
bsmt_cond_none NA NA NA NA
bsmt_cond_po -1.228e-10 4.705e-10 -2.610e-01 0.794076
bsmt_cond_ta NA NA NA NA
bsmt_exposure_av 2.244e-11 3.761e-10 6.000e-02 0.952426
bsmt_exposure_gd 1.283e-10 3.780e-10 3.400e-01 0.734264
bsmt_exposure_mn 1.968e-12 3.771e-10 5.000e-03 0.995837
bsmt_exposure_no -2.369e-11 3.753e-10 -6.300e-02 0.949691
bsmt_exposure_none NA NA NA NA
bsmt_fin_type1_alq 5.641e-11 3.874e-11 1.456e+00 0.145583
bsmt_fin_type1_blq -3.629e-12 4.458e-11 -8.100e-02 0.935126
bsmt_fin_type1_glq -4.243e-12 3.262e-11 -1.300e-01 0.896518
bsmt_fin_type1_lw_q -1.804e-11 5.601e-11 -3.220e-01 0.747491
bsmt_fin_type1_none NA NA NA NA
bsmt_fin_type1_rec -6.604e-12 4.577e-11 -1.440e-01 0.885293
bsmt_fin_type1_unf NA NA NA NA
bsmt_fin_type2_alq -7.360e-11 9.835e-11 -7.480e-01 0.454387
bsmt_fin_type2_blq -1.253e-11 7.170e-11 -1.750e-01 0.861244
bsmt_fin_type2_glq -9.330e-11 1.219e-10 -7.660e-01 0.444033
bsmt_fin_type2_lw_q -2.092e-11 6.253e-11 -3.350e-01 0.738042
bsmt_fin_type2_none 3.171e-11 3.784e-10 8.400e-02 0.933231
bsmt_fin_type2_rec -2.491e-11 6.029e-11 -4.130e-01 0.679564
bsmt_fin_type2_unf NA NA NA NA
heating_floor -1.697e-10 4.837e-10 -3.510e-01 0.725826
heating_gas_a -3.301e-11 2.331e-10 -1.420e-01 0.887394
heating_gas_w -4.190e-11 2.488e-10 -1.680e-01 0.866281
heating_grav -5.942e-11 2.848e-10 -2.090e-01 0.834746
heating_oth_w -6.593e-11 3.742e-10 -1.760e-01 0.860160
heating_wall NA NA NA NA
heating_qc_ex 5.609e-11 3.325e-11 1.687e+00 0.091905 .
heating_qc_fa 2.061e-11 7.280e-11 2.830e-01 0.777131
heating_qc_gd 1.467e-11 3.478e-11 4.220e-01 0.673200
heating_qc_po -1.180e-10 4.289e-10 -2.750e-01 0.783299
heating_qc_ta NA NA NA NA
central_air_n 2.728e-11 6.183e-11 4.410e-01 0.659069
central_air_y NA NA NA NA
electrical_fuse_a -1.437e-11 4.713e-11 -3.050e-01 0.760439
electrical_fuse_f -8.448e-12 8.895e-11 -9.500e-02 0.924351
electrical_fuse_p 6.548e-11 2.976e-10 2.200e-01 0.825869
electrical_mix 6.625e-11 7.170e-10 9.200e-02 0.926394
electrical_s_brkr NA NA NA NA
kitchen_qual_ex -2.163e-11 6.362e-11 -3.400e-01 0.734001
kitchen_qual_fa -1.073e-11 7.895e-11 -1.360e-01 0.891874
kitchen_qual_gd -1.759e-11 3.298e-11 -5.340e-01 0.593749
kitchen_qual_ta NA NA NA NA
functional_maj1 8.874e-11 1.166e-10 7.610e-01 0.446707
functional_maj2 -2.987e-12 1.990e-10 -1.500e-02 0.988024
functional_min1 -1.940e-11 7.511e-11 -2.580e-01 0.796263
functional_min2 -9.135e-12 7.236e-11 -1.260e-01 0.899566
functional_mod 3.934e-11 1.210e-10 3.250e-01 0.745139
functional_sev 2.664e-10 4.629e-10 5.760e-01 0.565034
functional_typ NA NA NA NA
fireplace_qu_ex -3.732e-12 8.915e-11 -4.200e-02 0.966611
fireplace_qu_fa -8.307e-11 7.470e-11 -1.112e+00 0.266367
fireplace_qu_gd -5.794e-11 3.562e-11 -1.626e+00 0.104133
fireplace_qu_none -5.938e-11 3.409e-11 -1.742e+00 0.081807 .
fireplace_qu_po -8.932e-11 9.638e-11 -9.270e-01 0.354213
fireplace_qu_ta NA NA NA NA
garage_type_2types -3.122e-11 1.793e-10 -1.740e-01 0.861809
garage_type_attchd 3.126e-11 6.017e-11 5.200e-01 0.603503
garage_type_basment 2.547e-11 1.143e-10 2.230e-01 0.823633
garage_type_built_in 5.996e-11 7.697e-11 7.790e-01 0.436070
garage_type_car_port 1.972e-11 1.635e-10 1.210e-01 0.904035
garage_type_detchd 2.039e-11 5.647e-11 3.610e-01 0.718037
garage_type_none NA NA NA NA
garage_finish_fin 1.373e-11 3.860e-11 3.560e-01 0.722227
garage_finish_none NA NA NA NA
garage_finish_r_fn 4.864e-11 3.428e-11 1.419e+00 0.156167
garage_finish_unf NA NA NA NA
garage_qual_ex 4.720e-10 4.709e-10 1.002e+00 0.316358
garage_qual_fa -1.724e-11 7.646e-11 -2.250e-01 0.821677
garage_qual_gd -1.803e-12 1.226e-10 -1.500e-02 0.988271
garage_qual_none NA NA NA NA
garage_qual_po -6.771e-11 3.866e-10 -1.750e-01 0.860997
garage_qual_ta NA NA NA NA
garage_cond_ex -5.060e-10 5.461e-10 -9.270e-01 0.354315
garage_cond_fa 5.893e-12 8.769e-11 6.700e-02 0.946435
garage_cond_gd 2.317e-11 1.481e-10 1.560e-01 0.875708
garage_cond_none NA NA NA NA
garage_cond_po 6.234e-11 2.219e-10 2.810e-01 0.778789
garage_cond_ta NA NA NA NA
paved_drive_n -9.486e-12 5.481e-11 -1.730e-01 0.862643
paved_drive_p 2.423e-11 7.868e-11 3.080e-01 0.758140
paved_drive_y NA NA NA NA
pool_qc_ex 2.100e-10 3.049e-10 6.890e-01 0.491122
pool_qc_fa -1.596e-11 3.995e-10 -4.000e-02 0.968142
pool_qc_gd 2.270e-10 3.087e-10 7.360e-01 0.462160
pool_qc_none NA NA NA NA
fence_gd_prv 1.886e-11 5.916e-11 3.190e-01 0.749966
fence_gd_wo -1.718e-11 5.756e-11 -2.990e-01 0.765345
fence_mn_prv -1.154e-11 3.673e-11 -3.140e-01 0.753495
fence_mn_ww 1.213e-11 1.205e-10 1.010e-01 0.919800
fence_none NA NA NA NA
misc_feature_gar2 -1.646e-10 6.727e-10 -2.450e-01 0.806675
misc_feature_none -1.565e-10 5.548e-10 -2.820e-01 0.777853
misc_feature_othr -1.151e-10 6.331e-10 -1.820e-01 0.855793
misc_feature_shed -2.092e-10 5.591e-10 -3.740e-01 0.708278
misc_feature_ten_c NA NA NA NA
sale_type_cod 1.102e-11 6.734e-11 1.640e-01 0.870099
sale_type_con -6.292e-10 2.776e-10 -2.267e+00 0.023583 *
sale_type_con_ld -3.498e-11 1.442e-10 -2.430e-01 0.808303
sale_type_con_li -3.441e-11 1.762e-10 -1.950e-01 0.845216
sale_type_con_lw 1.441e-11 1.840e-10 7.800e-02 0.937592
sale_type_cwd 3.045e-11 1.972e-10 1.540e-01 0.877291
sale_type_new 2.398e-10 2.406e-10 9.970e-01 0.319075
sale_type_oth -4.789e-11 2.281e-10 -2.100e-01 0.833751
sale_type_wd NA NA NA NA
sale_condition_abnorml 2.246e-10 2.397e-10 9.370e-01 0.348864
sale_condition_adj_land 1.300e-10 3.240e-10 4.010e-01 0.688232
sale_condition_alloca 2.355e-10 2.721e-10 8.650e-01 0.386982
sale_condition_family 2.426e-10 2.526e-10 9.600e-01 0.337159
sale_condition_normal 2.262e-10 2.373e-10 9.530e-01 0.340707
sale_condition_partial NA NA NA NA
sale_price NA NA NA NA
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 3.712e-10 on 1241 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.065e+29 on 218 and 1241 DF, p-value: < 2.2e-16
After converting our categorical variables into dummy variables and running a model to assess their significance, we found that the variable "Neighborhood" emerged as the most significant predictor. This suggests that the neighborhood of a property plays a crucial role in determining the response variable. However, it's important to note that in a more complex model, the significance of "Neighborhood" may be influenced by its interaction with other variables.
Considering the complexity of the model, it is possible that interactions between "Neighborhood" and other variables could yield significant effects on the response variable. Therefore, it would be beneficial to explore these potential interactions and evaluate their significance in order to obtain a more comprehensive understanding of the predictors' impact on the outcome.
Here we can see that all of the neighborhood variables are significant, suggesting that neighborhood is an important factor in determining sale price. Several other categorical values look important including: lot_config_fr2, house_style1story, exter_qual_gd, fireplace_qu_none, and sale_type_con.
TransformationsΒΆ
In the following chunk of code, we are generating new columns to capture transformed versions of different attributes. These transformations will be further analyzed in the subsequent sections.
# Create columns for log(SalePrice) and log(GrLivArea)
ames$log_sale_price <- log(ames$sale_price)
ames$log_gr_liv_area <- log(ames$gr_liv_area)
ames$overall_qual_2 = ames$overall_qual^2
ames$lot_area_2 = ames$lot_area^2
ames$log_lot_area = ames$lot_area %>% log()
# ames$year_built_t = plogis(ames_non_dummy$year_built-1940)
ames$log_total_bsmt_sf = ames$total_bsmt_sf %>% log()
ames$log_garage_area = ames$garage_area %>% log()
ames$log_x1st_flr_sf = ames$x1st_flr_sf %>% log()
Sale Price Vs Gross Living Area by NeighborhoodΒΆ
# Plot Sale Price vs. Gross Living Area colored by neighborhood, omitting rows where SalePrice is NA
# Convert the dataframe from wide format to long format
ames_long <- ames %>%
pivot_longer(
cols = starts_with("neighborhood_"),
names_to = "Neighborhood",
values_to = "value"
) %>%
filter(value == 1) %>% # Keep only rows where the neighborhood dummy variable is 1
select(-value) # Remove the 'value' column as it's no longer needed
ames_long %>%
filter(!is.na(sale_price)) %>%
ggplot(aes(x = gr_liv_area, y = sale_price, color = Neighborhood)) +
geom_point(show.legend = FALSE) +
theme_gdocs() +
labs(title = "Sale Price vs. Gross Living Area by Neighborhood", x = "Gross Living Area", y = "Sale Price")
The relationship between Sale Price and Gross Living Area is evident, and we can observe that neighborhoods also exhibit distinct relationships with Sale Price. This reaffirms the significance of the Neighborhood variable as observed in the previous linear model.
Sale Price Vs Gross Living Area by Neighborhood (UnTransformed Vs Transformed)ΒΆ
# Untransformed variables
par(mfrow = c(1, 2))
ames %>%
ggplot(aes(x = gr_liv_area, y = sale_price)) +
geom_point() +
geom_smooth() +
theme_gdocs() +
labs(
title = "Sale Price vs. Gross Living Area",
x = "Gross Living Area",
y = "Sale Price"
)
# Log Transformed
ames %>%
ggplot(aes(x = log_gr_liv_area, y = log_sale_price)) +
geom_point() +
geom_smooth() +
theme_gdocs() +
labs(
title = "Log(Sale Price) vs. Log(Gross Living Area) by Neighborhood",
x = "Log(Gross Living Area)",
y = "Log(Sale Price)"
)
par(mfrow = c(1, 1))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' Warning message: βRemoved 1459 rows containing non-finite values (`stat_smooth()`).β Warning message: βRemoved 1459 rows containing missing values (`geom_point()`).β `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' Warning message: βRemoved 1459 rows containing non-finite values (`stat_smooth()`).β Warning message: βRemoved 1459 rows containing missing values (`geom_point()`).β
The log-transformed Sale Price and square footage exhibit a more linear relationship, indicating that utilizing these transformed variables will likely result in a more precise regression model. However, it is important to note that this enhanced accuracy comes at the cost of interpretability, as the transformed variables may be less intuitive to interpret directly.
EDA ConclusionΒΆ
During our exploratory data analysis (EDA), we undertook several important steps. Firstly, we cleaned the data by addressing any inconsistencies or missing values. Next, we examined both numerical and categorical variables, exploring their distributions and identifying potential patterns or trends.
We also assessed the relationships between the explanatory variables and the response variable using visual and mathematical techniques. These analyses provided valuable insights into the dependencies and correlations within the dataset.
By conducting this comprehensive EDA, we have equipped ourselves with the necessary foundation for achieving our two primary objectives: developing an interpretable regression model and constructing a more complex predictive model. The insights gained from our EDA will guide us in selecting meaningful features and formulating effective modeling strategies.
Objective 1: Interpretable Regression ModelΒΆ
For this model, we will fit a linear regression with the variables that we have identified as significant. Because the focus of this model is interpretability, we will not include any interaction terms, polynomials, or other transformations.
Based on the exploratory analysis above, we will include the following variables in the regression model: - gr_liv_area - lot_area - overall_qual - year_built - year_remod_add - total_bsmt_sf - garage_area - garage_cars-tot_rms_grd - all neighborhood dummy variables - lot_config_fr2 - house_style1story - exter_qual_gd - fireplace_qu_none - sale_type_con
# Split the data into training and testing sets
train <- ames %>%
filter(train == 1) %>%
select(-train)
test <- ames %>%
filter(train == 0) %>%
select(-train)
# Train a linear regression model with caret using CV
predictor_vars <- c(
"gr_liv_area", "lot_area", "overall_qual", "year_built", "year_remod_add",
"total_bsmt_sf", "garage_area","garage_cars","tot_rms_abv_grd", #"x1st_flr_sf", "x2nd_flr_sf", #removed for vif
"lot_config_fr2", "house_style1story", "exter_qual_gd", "fireplace_qu_none", "sale_type_con"
) %>% paste(collapse = " + ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " + ", collapse = " + "))
formula <- as.formula(paste("sale_price ~", terms, "- neighborhood_veenker"))
set.seed(137)
ctrl <- trainControl(method = "cv", number = 10, verboseIter = TRUE)
lmFit <- train(formula, data = train, method = "lm", trControl = ctrl, metric = "RMSE")
summary(lmFit)
confint(lmFit$finalModel)
library(car)
vif(lmFit$finalModel)
# Plot the RMSE for each fold
lmFit$resample %>%
ggplot(aes(x = (1:10), y = RMSE)) +
geom_point() +
geom_line() +
theme_gdocs() +
labs(title = "RMSE for each fold", x = "Fold", y = "RMSE")
+ Fold01: intercept=TRUE - Fold01: intercept=TRUE + Fold02: intercept=TRUE - Fold02: intercept=TRUE + Fold03: intercept=TRUE - Fold03: intercept=TRUE + Fold04: intercept=TRUE - Fold04: intercept=TRUE + Fold05: intercept=TRUE - Fold05: intercept=TRUE + Fold06: intercept=TRUE - Fold06: intercept=TRUE + Fold07: intercept=TRUE - Fold07: intercept=TRUE + Fold08: intercept=TRUE - Fold08: intercept=TRUE + Fold09: intercept=TRUE - Fold09: intercept=TRUE + Fold10: intercept=TRUE - Fold10: intercept=TRUE Aggregating results Fitting final model on full training set
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-133463 -14543 267 14083 229636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.477e+06 1.462e+05 -10.098 < 2e-16 ***
gr_liv_area 6.197e+01 3.872e+00 16.006 < 2e-16 ***
lot_area 7.219e-01 9.298e-02 7.764 1.56e-14 ***
overall_qual 1.489e+04 1.074e+03 13.864 < 2e-16 ***
year_built 3.766e+02 6.327e+01 5.952 3.34e-09 ***
year_remod_add 3.633e+02 5.383e+01 6.749 2.17e-11 ***
total_bsmt_sf 2.940e+01 3.083e+00 9.537 < 2e-16 ***
garage_area 2.915e+01 8.774e+00 3.322 0.000915 ***
garage_cars 2.010e+03 2.601e+03 0.773 0.439841
tot_rms_abv_grd -1.963e+03 9.365e+02 -2.096 0.036268 *
lot_config_fr2 -6.850e+03 4.603e+03 -1.488 0.136901
house_style1story 4.185e+03 2.376e+03 1.762 0.078351 .
exter_qual_gd -1.492e+04 2.452e+03 -6.084 1.50e-09 ***
fireplace_qu_none -4.561e+03 1.989e+03 -2.293 0.021969 *
sale_type_con 2.274e+04 2.208e+04 1.030 0.303223
neighborhood_blmngtn -4.480e+04 1.218e+04 -3.677 0.000245 ***
neighborhood_blueste -5.290e+04 2.363e+04 -2.239 0.025297 *
neighborhood_br_dale -4.722e+04 1.254e+04 -3.767 0.000172 ***
neighborhood_brk_side -1.630e+04 1.081e+04 -1.508 0.131774
neighborhood_clear_cr -2.934e+04 1.126e+04 -2.605 0.009278 **
neighborhood_collg_cr -3.181e+04 9.895e+03 -3.215 0.001333 **
neighborhood_crawfor -4.416e+03 1.068e+04 -0.413 0.679304
neighborhood_edwards -2.995e+04 1.027e+04 -2.916 0.003606 **
neighborhood_gilbert -3.742e+04 1.027e+04 -3.644 0.000278 ***
neighborhood_idotrr -2.954e+04 1.133e+04 -2.608 0.009190 **
neighborhood_meadow_v -3.609e+04 1.240e+04 -2.909 0.003682 **
neighborhood_mitchel -4.235e+04 1.055e+04 -4.013 6.31e-05 ***
neighborhood_n_ames -3.303e+04 9.901e+03 -3.336 0.000873 ***
neighborhood_no_ridge 1.206e+04 1.076e+04 1.121 0.262531
neighborhood_n_pk_vill -4.532e+04 1.401e+04 -3.234 0.001247 **
neighborhood_nridg_ht 7.301e+03 1.029e+04 0.709 0.478178
neighborhood_nw_ames -4.601e+04 1.023e+04 -4.499 7.38e-06 ***
neighborhood_old_town -3.666e+04 1.059e+04 -3.461 0.000555 ***
neighborhood_sawyer -3.350e+04 1.034e+04 -3.241 0.001218 **
neighborhood_sawyer_w -3.583e+04 1.035e+04 -3.461 0.000555 ***
neighborhood_somerst -2.540e+04 1.013e+04 -2.506 0.012313 *
neighborhood_stone_br 1.859e+04 1.136e+04 1.637 0.101919
neighborhood_swisu -3.692e+04 1.187e+04 -3.109 0.001914 **
neighborhood_timber -2.916e+04 1.076e+04 -2.710 0.006805 **
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 30340 on 1419 degrees of freedom
Multiple R-squared: 0.8581, Adjusted R-squared: 0.8543
F-statistic: 225.8 on 38 and 1419 DF, p-value: < 2.2e-16
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -1.763431e+06 | -1.189758e+06 |
| gr_liv_area | 5.437895e+01 | 6.956926e+01 |
| lot_area | 5.395554e-01 | 9.043484e-01 |
| overall_qual | 1.278191e+04 | 1.699522e+04 |
| year_built | 2.524593e+02 | 5.006983e+02 |
| year_remod_add | 2.576682e+02 | 4.688461e+02 |
| total_bsmt_sf | 2.335389e+01 | 3.544940e+01 |
| garage_area | 1.193809e+01 | 4.635941e+01 |
| garage_cars | -3.092948e+03 | 7.113031e+03 |
| tot_rms_abv_grd | -3.800020e+03 | -1.257428e+02 |
| lot_config_fr2 | -1.587970e+04 | 2.178861e+03 |
| house_style1story | -4.752613e+02 | 8.846175e+03 |
| exter_qual_gd | -1.973187e+04 | -1.011022e+04 |
| fireplace_qu_none | -8.462397e+03 | -6.598585e+02 |
| sale_type_con | -2.057103e+04 | 6.604946e+04 |
| neighborhood_blmngtn | -6.870347e+04 | -2.090313e+04 |
| neighborhood_blueste | -9.924745e+04 | -6.557713e+03 |
| neighborhood_br_dale | -7.180779e+04 | -2.262764e+04 |
| neighborhood_brk_side | -3.749566e+04 | 4.902204e+03 |
| neighborhood_clear_cr | -5.142601e+04 | -7.247184e+03 |
| neighborhood_collg_cr | -5.122557e+04 | -1.240342e+04 |
| neighborhood_crawfor | -2.536286e+04 | 1.653178e+04 |
| neighborhood_edwards | -5.010621e+04 | -9.800132e+03 |
| neighborhood_gilbert | -5.756889e+04 | -1.727781e+04 |
| neighborhood_idotrr | -5.176178e+04 | -7.326374e+03 |
| neighborhood_meadow_v | -6.042013e+04 | -1.175223e+04 |
| neighborhood_mitchel | -6.305431e+04 | -2.164816e+04 |
| neighborhood_n_ames | -5.244687e+04 | -1.360404e+04 |
| neighborhood_no_ridge | -9.046799e+03 | 3.316849e+04 |
| neighborhood_n_pk_vill | -7.281350e+04 | -1.783595e+04 |
| neighborhood_nridg_ht | -1.288705e+04 | 2.748901e+04 |
| neighborhood_nw_ames | -6.607272e+04 | -2.595021e+04 |
| neighborhood_old_town | -5.743547e+04 | -1.587850e+04 |
| neighborhood_sawyer | -5.377424e+04 | -1.322397e+04 |
| neighborhood_sawyer_w | -5.614498e+04 | -1.552221e+04 |
| neighborhood_somerst | -4.527509e+04 | -5.518848e+03 |
| neighborhood_stone_br | -3.690534e+03 | 4.086611e+04 |
| neighborhood_swisu | -6.021113e+04 | -1.362483e+04 |
| neighborhood_timber | -5.025784e+04 | -8.052844e+03 |
- gr_liv_area
- 6.11969596779814
- lot_area
- 1.33000515946677
- overall_qual
- 3.45775522115555
- year_built
- 5.77631976937239
- year_remod_add
- 1.9537417499232
- total_bsmt_sf
- 2.5905061543776
- garage_area
- 5.48761424547523
- garage_cars
- 5.97790450731877
- tot_rms_abv_grd
- 3.62442415580001
- lot_config_fr2
- 1.04677009206856
- house_style1story
- 2.23497852236574
- exter_qual_gd
- 2.12105019203486
- fireplace_qu_none
- 1.56150065862576
- sale_type_con
- 1.05752986809444
- neighborhood_blmngtn
- 2.70916191149184
- neighborhood_blueste
- 1.2109178183108
- neighborhood_br_dale
- 2.70100266329667
- neighborhood_brk_side
- 7.06486582435775
- neighborhood_clear_cr
- 3.78252961365224
- neighborhood_collg_cr
- 14.3125822413063
- neighborhood_crawfor
- 6.09594664621172
- neighborhood_edwards
- 10.4801137038149
- neighborhood_gilbert
- 8.55991313432559
- neighborhood_idotrr
- 5.02474712699903
- neighborhood_meadow_v
- 2.8083936439461
- neighborhood_mitchel
- 5.72923615321761
- neighborhood_n_ames
- 20.2594427028515
- neighborhood_no_ridge
- 5.01133360275701
- neighborhood_n_pk_vill
- 1.90783894628335
- neighborhood_nridg_ht
- 8.39058585283584
- neighborhood_nw_ames
- 7.87787061297265
- neighborhood_old_town
- 12.7042370817657
- neighborhood_sawyer
- 8.15108214729673
- neighborhood_sawyer_w
- 6.59278330446856
- neighborhood_somerst
- 9.02658161449665
- neighborhood_stone_br
- 3.44247827334504
- neighborhood_swisu
- 3.76324729504303
- neighborhood_timber
- 4.65221482319534
Summary table of coefficients.
# Summary table of coefficients
# Create a tidy data frame from the model and round the numbers
tidy_fit <- lmFit$finalModel %>%
broom::tidy() %>%
mutate(across(where(is.numeric), ~round(., 4)))
# Create a table with bolded rows for p-value < 0.05
table <- tidy_fit %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
row_spec(which(tidy_fit$p.value < 0.05), bold = T)
table
<table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> (Intercept) </td> <td style="text-align:right;font-weight: bold;"> -1476594.2891 </td> <td style="text-align:right;font-weight: bold;"> 146222.8999 </td> <td style="text-align:right;font-weight: bold;"> -10.0982 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> gr_liv_area </td> <td style="text-align:right;font-weight: bold;"> 61.9741 </td> <td style="text-align:right;font-weight: bold;"> 3.8718 </td> <td style="text-align:right;font-weight: bold;"> 16.0064 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> lot_area </td> <td style="text-align:right;font-weight: bold;"> 0.7220 </td> <td style="text-align:right;font-weight: bold;"> 0.0930 </td> <td style="text-align:right;font-weight: bold;"> 7.7644 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> overall_qual </td> <td style="text-align:right;font-weight: bold;"> 14888.5629 </td> <td style="text-align:right;font-weight: bold;"> 1073.9280 </td> <td style="text-align:right;font-weight: bold;"> 13.8637 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> year_built </td> <td style="text-align:right;font-weight: bold;"> 376.5788 </td> <td style="text-align:right;font-weight: bold;"> 63.2734 </td> <td style="text-align:right;font-weight: bold;"> 5.9516 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> year_remod_add </td> <td style="text-align:right;font-weight: bold;"> 363.2571 </td> <td style="text-align:right;font-weight: bold;"> 53.8269 </td> <td style="text-align:right;font-weight: bold;"> 6.7486 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> total_bsmt_sf </td> <td style="text-align:right;font-weight: bold;"> 29.4016 </td> <td style="text-align:right;font-weight: bold;"> 3.0830 </td> <td style="text-align:right;font-weight: bold;"> 9.5367 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> garage_area </td> <td style="text-align:right;font-weight: bold;"> 29.1488 </td> <td style="text-align:right;font-weight: bold;"> 8.7736 </td> <td style="text-align:right;font-weight: bold;"> 3.3223 </td> <td style="text-align:right;font-weight: bold;"> 0.0009 </td> </tr> <tr> <td style="text-align:left;"> garage_cars </td> <td style="text-align:right;"> 2010.0417 </td> <td style="text-align:right;"> 2601.3932 </td> <td style="text-align:right;"> 0.7727 </td> <td style="text-align:right;"> 0.4398 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> tot_rms_abv_grd </td> <td style="text-align:right;font-weight: bold;"> -1962.8816 </td> <td style="text-align:right;font-weight: bold;"> 936.5334 </td> <td style="text-align:right;font-weight: bold;"> -2.0959 </td> <td style="text-align:right;font-weight: bold;"> 0.0363 </td> </tr> <tr> <td style="text-align:left;"> lot_config_fr2 </td> <td style="text-align:right;"> -6850.4208 </td> <td style="text-align:right;"> 4602.9317 </td> <td style="text-align:right;"> -1.4883 </td> <td style="text-align:right;"> 0.1369 </td> </tr> <tr> <td style="text-align:left;"> house_style1story </td> <td style="text-align:right;"> 4185.4567 </td> <td style="text-align:right;"> 2375.9327 </td> <td style="text-align:right;"> 1.7616 </td> <td style="text-align:right;"> 0.0784 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> exter_qual_gd </td> <td style="text-align:right;font-weight: bold;"> -14921.0417 </td> <td style="text-align:right;font-weight: bold;"> 2452.4548 </td> <td style="text-align:right;font-weight: bold;"> -6.0841 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> fireplace_qu_none </td> <td style="text-align:right;font-weight: bold;"> -4561.1277 </td> <td style="text-align:right;font-weight: bold;"> 1988.7823 </td> <td style="text-align:right;font-weight: bold;"> -2.2934 </td> <td style="text-align:right;font-weight: bold;"> 0.0220 </td> </tr> <tr> <td style="text-align:left;"> sale_type_con </td> <td style="text-align:right;"> 22739.2140 </td> <td style="text-align:right;"> 22078.6211 </td> <td style="text-align:right;"> 1.0299 </td> <td style="text-align:right;"> 0.3032 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_blmngtn </td> <td style="text-align:right;font-weight: bold;"> -44803.3000 </td> <td style="text-align:right;font-weight: bold;"> 12183.7882 </td> <td style="text-align:right;font-weight: bold;"> -3.6773 </td> <td style="text-align:right;font-weight: bold;"> 0.0002 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_blueste </td> <td style="text-align:right;font-weight: bold;"> -52902.5804 </td> <td style="text-align:right;font-weight: bold;"> 23625.6062 </td> <td style="text-align:right;font-weight: bold;"> -2.2392 </td> <td style="text-align:right;font-weight: bold;"> 0.0253 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_br_dale </td> <td style="text-align:right;font-weight: bold;"> -47217.7155 </td> <td style="text-align:right;font-weight: bold;"> 12535.4866 </td> <td style="text-align:right;font-weight: bold;"> -3.7667 </td> <td style="text-align:right;font-weight: bold;"> 0.0002 </td> </tr> <tr> <td style="text-align:left;"> neighborhood_brk_side </td> <td style="text-align:right;"> -16296.7257 </td> <td style="text-align:right;"> 10806.7538 </td> <td style="text-align:right;"> -1.5080 </td> <td style="text-align:right;"> 0.1318 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_clear_cr </td> <td style="text-align:right;font-weight: bold;"> -29336.5949 </td> <td style="text-align:right;font-weight: bold;"> 11260.7015 </td> <td style="text-align:right;font-weight: bold;"> -2.6052 </td> <td style="text-align:right;font-weight: bold;"> 0.0093 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_collg_cr </td> <td style="text-align:right;font-weight: bold;"> -31814.4945 </td> <td style="text-align:right;font-weight: bold;"> 9895.3420 </td> <td style="text-align:right;font-weight: bold;"> -3.2151 </td> <td style="text-align:right;font-weight: bold;"> 0.0013 </td> </tr> <tr> <td style="text-align:left;"> neighborhood_crawfor </td> <td style="text-align:right;"> -4415.5412 </td> <td style="text-align:right;"> 10678.4877 </td> <td style="text-align:right;"> -0.4135 </td> <td style="text-align:right;"> 0.6793 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_edwards </td> <td style="text-align:right;font-weight: bold;"> -29953.1689 </td> <td style="text-align:right;font-weight: bold;"> 10273.5802 </td> <td style="text-align:right;font-weight: bold;"> -2.9156 </td> <td style="text-align:right;font-weight: bold;"> 0.0036 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_gilbert </td> <td style="text-align:right;font-weight: bold;"> -37423.3500 </td> <td style="text-align:right;font-weight: bold;"> 10269.7604 </td> <td style="text-align:right;font-weight: bold;"> -3.6440 </td> <td style="text-align:right;font-weight: bold;"> 0.0003 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_idotrr </td> <td style="text-align:right;font-weight: bold;"> -29544.0795 </td> <td style="text-align:right;font-weight: bold;"> 11326.1032 </td> <td style="text-align:right;font-weight: bold;"> -2.6085 </td> <td style="text-align:right;font-weight: bold;"> 0.0092 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_meadow_v </td> <td style="text-align:right;font-weight: bold;"> -36086.1806 </td> <td style="text-align:right;font-weight: bold;"> 12404.9167 </td> <td style="text-align:right;font-weight: bold;"> -2.9090 </td> <td style="text-align:right;font-weight: bold;"> 0.0037 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_mitchel </td> <td style="text-align:right;font-weight: bold;"> -42351.2343 </td> <td style="text-align:right;font-weight: bold;"> 10553.9768 </td> <td style="text-align:right;font-weight: bold;"> -4.0128 </td> <td style="text-align:right;font-weight: bold;"> 0.0001 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_n_ames </td> <td style="text-align:right;font-weight: bold;"> -33025.4550 </td> <td style="text-align:right;font-weight: bold;"> 9900.6170 </td> <td style="text-align:right;font-weight: bold;"> -3.3357 </td> <td style="text-align:right;font-weight: bold;"> 0.0009 </td> </tr> <tr> <td style="text-align:left;"> neighborhood_no_ridge </td> <td style="text-align:right;"> 12060.8469 </td> <td style="text-align:right;"> 10760.2191 </td> <td style="text-align:right;"> 1.1209 </td> <td style="text-align:right;"> 0.2625 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_n_pk_vill </td> <td style="text-align:right;font-weight: bold;"> -45324.7249 </td> <td style="text-align:right;font-weight: bold;"> 14013.1806 </td> <td style="text-align:right;font-weight: bold;"> -3.2344 </td> <td style="text-align:right;font-weight: bold;"> 0.0012 </td> </tr> <tr> <td style="text-align:left;"> neighborhood_nridg_ht </td> <td style="text-align:right;"> 7300.9826 </td> <td style="text-align:right;"> 10291.4188 </td> <td style="text-align:right;"> 0.7094 </td> <td style="text-align:right;"> 0.4782 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_nw_ames </td> <td style="text-align:right;font-weight: bold;"> -46011.4606 </td> <td style="text-align:right;font-weight: bold;"> 10226.7919 </td> <td style="text-align:right;font-weight: bold;"> -4.4991 </td> <td style="text-align:right;font-weight: bold;"> 0.0000 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_old_town </td> <td style="text-align:right;font-weight: bold;"> -36656.9849 </td> <td style="text-align:right;font-weight: bold;"> 10592.4212 </td> <td style="text-align:right;font-weight: bold;"> -3.4607 </td> <td style="text-align:right;font-weight: bold;"> 0.0006 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_sawyer </td> <td style="text-align:right;font-weight: bold;"> -33499.1080 </td> <td style="text-align:right;font-weight: bold;"> 10335.8225 </td> <td style="text-align:right;font-weight: bold;"> -3.2411 </td> <td style="text-align:right;font-weight: bold;"> 0.0012 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_sawyer_w </td> <td style="text-align:right;font-weight: bold;"> -35833.5952 </td> <td style="text-align:right;font-weight: bold;"> 10354.3025 </td> <td style="text-align:right;font-weight: bold;"> -3.4607 </td> <td style="text-align:right;font-weight: bold;"> 0.0006 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_somerst </td> <td style="text-align:right;font-weight: bold;"> -25396.9710 </td> <td style="text-align:right;font-weight: bold;"> 10133.4350 </td> <td style="text-align:right;font-weight: bold;"> -2.5063 </td> <td style="text-align:right;font-weight: bold;"> 0.0123 </td> </tr> <tr> <td style="text-align:left;"> neighborhood_stone_br </td> <td style="text-align:right;"> 18587.7891 </td> <td style="text-align:right;"> 11357.0050 </td> <td style="text-align:right;"> 1.6367 </td> <td style="text-align:right;"> 0.1019 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_swisu </td> <td style="text-align:right;font-weight: bold;"> -36917.9793 </td> <td style="text-align:right;font-weight: bold;"> 11874.3430 </td> <td style="text-align:right;font-weight: bold;"> -3.1091 </td> <td style="text-align:right;font-weight: bold;"> 0.0019 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> neighborhood_timber </td> <td style="text-align:right;font-weight: bold;"> -29155.3435 </td> <td style="text-align:right;font-weight: bold;"> 10757.5958 </td> <td style="text-align:right;font-weight: bold;"> -2.7102 </td> <td style="text-align:right;font-weight: bold;"> 0.0068 </td> </tr> </tbody> </table>
Note that the reference Neighborhood is Veenker, so all neighborhood adjustments are relative to it.
The interpretable linear regression does moderately well. Its RMSE from 10-fold cross validation on the training data is $30,340. This means that the model is within about $60,000 95% of the time. Given that the mean sale price for a house in Ames during the time period covered by our dataset is $180,000, the RMSE implies that the predicted price is within 33% of the actual price 95% of the time. This is not great, but it is a good starting point.
The benefit of this type of model is its interpretability. To demonstrate this, we will interpret one numerical coefficient and one categorical coefficient.
Holding all other variables constant a one hundred square foot increase in gross living area is associated with a $4,593 increase in sale price (p < 0.001 from linear regression). Based on our model, we can be 95% confidence that the true increase in sale price is between $4,171 and $5,013 for a one hundred square foot increase in gross living area.
Holding all other variables constant, being located in the Old Town neighborhood is associated with a $47,703 decrease in sale price compared to a house in the Veenker neighborhood (p < 0.001 from linear regression). Based on our model, we can be 95% confident that the true decrease in sale price is between $35,888 and $59,518 for a house in the Old Town neighborhood compared to a house in the Veenker neighborhood.
Because we are using a linear regression model, we must check the assumptions of the model:
plot(lmFit$finalModel)
The residuals for this model show some evidence of non-linearity and non-constant variance (heteroscedasticity). There is no evidence of non-normality, and there are no influential points that need to be addressed. We will address the issues in the next section when including transformations in our model.
Objective 2: Predictive ModelΒΆ
To fit a linear model with more complexity, we included the transformations dicussed in the EDA. This includes using the log of sale price, gross living area, and the other areas measured. Using log transformations will make it difficult to interpret the coefficients, but it will result in better predictions based on the realtionships shown below.
EDA for transformed continuous variables:
# ames_non_dummy <- ames[sapply(ames, calculate_range) != 1]
train %>%
select(log_sale_price, log_gr_liv_area, log_lot_area, overall_qual_2, overall_cond,
year_built, year_remod_add, log_total_bsmt_sf, log_garage_area, bedroom_abv_gr, log_x1st_flr_sf) %>%
ggpairs(lower=list(continuous=lowerFn))
Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 36β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 13β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 3.6457e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 36β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 13β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 3.6457e-15β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β Warning message: βRemoved 37 rows containing non-finite values (`stat_density()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 36β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 13β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 5.4002e-15β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 36β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 13β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 5.4002e-15β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 111 rows containing non-finite values (`stat_smooth()`).β Warning message: βRemoved 81 rows containing non-finite values (`stat_density()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β Warning message: βRemoved 1 rows containing missing values (`geom_text()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 5β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 5β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 37 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message: βRemoved 81 rows containing non-finite values (`stat_smooth()`).β `geom_smooth()` using formula = 'y ~ x' Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βpseudoinverse used at 2β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βneighborhood radius 1β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βreciprocal condition number 0β Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : βThere are other near singularities as well. 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βpseudoinverse used at 2β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βneighborhood radius 1β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βreciprocal condition number 0β Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), : βThere are other near singularities as well. 1β
EDA For interactions?: Running interaction model in the background. Will include if it finishes in time.
Complex LR with feature selection:
# Define variables to be used and create formula
predictor_vars <- c(
"log_gr_liv_area", "log_lot_area", "overall_qual_2", "year_built", "year_remod_add",
"log_total_bsmt_sf", "log_garage_area", "lot_config_fr2", "house_style1story", "exter_qual_gd",
"fireplace_qu_none", "sale_type_con"#, ". -sale_price" #Can include "." to make really complex
) %>% paste(collapse = " + ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " + "))
formula <- as.formula(paste("log_sale_price ~", terms, "- neighborhood_veenker"))
# Complex LR with stepwise selection
# Check if the model object exists, train if it doesn't
if (file.exists("Models/lm_complex.rds")) {
# Load the model object from disk
lmComp <- readRDS("Models/lm_complex.rds")
} else {
# Set up a parallel backend with the number of cores you want to use
cores <- 8 # Change this to the number of cores you want to use
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(137)
lmComp <- train(formula,
data = train,
method = "glmnet",
trControl = trainControl(method = "cv", number = 5, allowParallel = TRUE),
direction = "both",
penter = 0.05 # Not Working.
)
# Stop the parallel backend
stopCluster(cl)
# Save the model object to disk
saveRDS(lmComp, "Models/lm_complex.rds")
}
defaultSummary(data.frame(pred = predict(lmComp), obs = train$log_sale_price))
varImp(lmComp$finalModel) %>%
filter(Overall > 0) %>%
arrange(desc(Overall))
# Glmnet Regression model summary
lmComp
plot(lmComp)
opt.pen<-lmComp$finalModel$lambdaOpt #penalty term
coef(lmComp$finalModel,opt.pen)
# Output the predictions for the test set to a csv file
# Select only these variables from the testing dataset
# Get the names of the variables used in the model
var_names <- lmComp$finalModel$xNames
new_test <- test[, c(var_names, "neighborhood_veenker")]
id_col <- test$id
stepwise_pred <- predict(lmComp, newdata = as.matrix(new_test))
# Save predictions
data.frame(id = id_col, SalePrice = exp(stepwise_pred)) %>%
dplyr::select(id, SalePrice) %>%
write_csv("Predictions/complexlm_predictions.csv")
# stepwise_pred %>%
# data.frame() %>%
# rownames_to_column(var = "id") %>%
# mutate(SalePrice = exp(stepwise_pred)) %>%
# dplyr::select(id, SalePrice) %>%
# write_csv("Predictions/complexlm_predictions.csv")
- RMSE
- 0.141264697972528
- Rsquared
- 0.875014600410696
- MAE
- 0.10386416888613
| Overall | |
|---|---|
| <dbl> | |
| log_gr_liv_area | 0.404857597 |
| neighborhood_idotrr | 0.205928490 |
| neighborhood_gilbert | 0.156110544 |
| log_lot_area | 0.139474710 |
| neighborhood_edwards | 0.131285722 |
| neighborhood_br_dale | 0.110631767 |
| neighborhood_old_town | 0.105102551 |
| neighborhood_sawyer_w | 0.104842305 |
| neighborhood_crawfor | 0.096610610 |
| sale_type_con | 0.095257832 |
| neighborhood_mitchel | 0.092102541 |
| neighborhood_nw_ames | 0.090806304 |
| neighborhood_meadow_v | 0.090050461 |
| neighborhood_sawyer | 0.076540423 |
| neighborhood_collg_cr | 0.072704174 |
| neighborhood_timber | 0.069457450 |
| neighborhood_stone_br | 0.069004205 |
| neighborhood_blmngtn | 0.052052229 |
| fireplace_qu_none | 0.051729657 |
| lot_config_fr2 | 0.048520536 |
| neighborhood_swisu | 0.046572320 |
| neighborhood_n_ames | 0.043194588 |
| neighborhood_somerst | 0.042896560 |
| neighborhood_no_ridge | 0.032071842 |
| house_style1story | 0.029014767 |
| neighborhood_brk_side | 0.022692580 |
| exter_qual_gd | 0.018449066 |
| neighborhood_blueste | 0.011140420 |
| overall_qual_2 | 0.007509962 |
| neighborhood_clear_cr | 0.006013703 |
| neighborhood_nridg_ht | 0.005576222 |
| neighborhood_n_pk_vill | 0.004286030 |
| year_built | 0.002918634 |
| year_remod_add | 0.002635765 |
glmnet 1458 samples 37 predictor No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 1166, 1166, 1167, 1166, 1167 Resampling results across tuning parameters: alpha lambda RMSE Rsquared MAE 0.10 0.0006543964 0.1457942 0.8677977 0.1072713 0.10 0.0065439640 0.1459272 0.8675386 0.1071822 0.10 0.0654396404 0.1498018 0.8651849 0.1085737 0.55 0.0006543964 0.1458369 0.8676922 0.1072402 0.55 0.0065439640 0.1461345 0.8672977 0.1072563 0.55 0.0654396404 0.1665586 0.8475884 0.1189285 1.00 0.0006543964 0.1458812 0.8676002 0.1072906 1.00 0.0065439640 0.1471834 0.8656324 0.1081394 1.00 0.0654396404 0.1856431 0.8263513 0.1338341 RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 0.1 and lambda = 0.0006543964.
37 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -3.384558508
log_gr_liv_area 0.404857597
log_lot_area 0.139474710
overall_qual_2 0.007509962
year_built 0.002918634
year_remod_add 0.002635765
log_total_bsmt_sf .
log_garage_area .
lot_config_fr2 -0.048520536
house_style1story 0.029014767
exter_qual_gd -0.018449066
fireplace_qu_none -0.051729657
sale_type_con 0.095257832
neighborhood_blmngtn -0.052052229
neighborhood_blueste -0.011140420
neighborhood_br_dale -0.110631767
neighborhood_brk_side -0.022692580
neighborhood_clear_cr -0.006013703
neighborhood_collg_cr -0.072704174
neighborhood_crawfor 0.096610610
neighborhood_edwards -0.131285722
neighborhood_gilbert -0.156110544
neighborhood_idotrr -0.205928490
neighborhood_meadow_v -0.090050461
neighborhood_mitchel -0.092102541
neighborhood_n_ames -0.043194588
neighborhood_no_ridge 0.032071842
neighborhood_n_pk_vill 0.004286030
neighborhood_nridg_ht 0.005576222
neighborhood_nw_ames -0.090806304
neighborhood_old_town -0.105102551
neighborhood_sawyer -0.076540423
neighborhood_sawyer_w -0.104842305
neighborhood_somerst -0.042896560
neighborhood_stone_br 0.069004205
neighborhood_swisu -0.046572320
neighborhood_timber -0.069457450 The linear regression model with transformations does better than the interpretable model. The RMSE is 0.0152 on the log scale which translates to roughly a 16% multiplicative change on the original scale. Interpreting this is difficult due to the complexity of the model, but this approximately corresponds to a $27,000 error in the predicted sale price. This is a significant improvement over the $30,000 error from the interpretable model.
From the coefficients, we can see that the penalized regression included most of the coefficients from the previous model. Total basement area and garage are ended up being excluded, as well as Neghborhoods blmngtn and blueste.
Because we are still using a linear regression model, we must check the assumptions of the model:
# Plot the residuals of lmComp
# Choose a lambda value
lambda <- lmComp$bestTune$lambda
# Get predictions for this lambda
predictions <- predict(lmComp$finalModel, newx = as.matrix(train[, c(var_names)]), s = lambda)
# Calculate residuals
residuals <- train$log_sale_price - predictions
# Plot residuals
ggplot() +
geom_point(aes(x = predictions, y = residuals)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values")
## Q-Q plot
qqnorm(residuals)
The residuals for this model don't show any evidence of non-linearity or non-constant variance (heteroscedasticity). There is no evidence of non-normality. There are two influential points that could be addressed if there was more time. Because of the transformations, the more complex linear model better meets the assumptions of linear regression than the interpretable model.
# Non-Parametric model
library(randomForest)
library(ggplot2)
set.seed(1234)
predictor_vars <- c(
"log_gr_liv_area", "log_lot_area", "overall_qual_2", "year_built", "year_remod_add",
"log_total_bsmt_sf", "log_garage_area", "lot_config_fr2", "house_style1story", "exter_qual_gd",
"fireplace_qu_none", "sale_type_con"#, ". -sale_price" #Can include "." to make really complex
) %>% paste(collapse = " * ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " * "))
formula <- as.formula(paste("log_sale_price ~", terms, "- neighborhood_veenker"))
library(dplyr)
#removing rows with infinity in one of the values
df <- train[!is.infinite(rowSums(train)),]
rf.fit <- randomForest(formula, data = df,ntree=500)
#prediction on test case
df_test <- test[,!is.na(colSums(test))]
df_test <- df_test[!is.infinite(rowSums(df_test)),]
df_test['sale_price'] <- predict(rf.fit, newdata= df_test)
print(rf.fit)
plot(rf.fit)
## Visualize variable importance ----------------------------------------------
# Get variable importance from the model fit
ImpData <- as.data.frame(importance(rf.fit))
ImpData$Var.Names <- row.names(ImpData)
ggplot(ImpData, aes(x=Var.Names, y=`IncNodePurity`)) +
geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`IncNodePurity`), color="skyblue") +
geom_point(aes(size = IncNodePurity), color="blue", alpha=0.6) +
theme_light() +
coord_flip() +
theme(
legend.position="bottom",
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: βrandomForestβ
The following object is masked from βpackage:psychβ:
outlier
The following object is masked from βpackage:dplyrβ:
combine
The following object is masked from βpackage:ggplot2β:
margin
Call:
randomForest(formula = formula, data = df, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 12
Mean of squared residuals: 0.01842256
% Var explained: 87.03
The more complex linear regression model improved on the interpretable linear regression and the random forest further improved on that. The increase in predictive power, however, comes at the cost of interpretability of the model. The complex linear regression could be interpreted with some effort, but the random forest is closer to a black box that can only be used for prediction.
ConclusionΒΆ
The first objective was to build a linear regression to explain some of the variation in Sale Price of homes in the Ames, IA dataset. We showed that it is possible to build a linear regression that is useful for interpreting the effects of variables of interest, but that this type of model was not the best choice for predictive accuracy. We recommend this type of model for a person who is interested in understanding the effects of variables on housing sale prices, but not necessarily for predicting the sale price of a home. For example, the could be very useful for a developer who is deciding what features to include in a new development.
The second objective was to build a model that would be useful for predicting the sale price of homes in the Ames, IA dataset. We showed that a random forest model was the best choice for predictive accuracy, but that this model was not useful for interpreting the effects of variables of interest. A more complex linear regression model provides a compromise of both ends of the spectrum. Which model to use depends on the needs of the user. We recommend the random forest model for a person who is interested in predicting the sale price of a home without needing to understand the effects of variables. For example, this could be very useful for a real estate agent who is trying to price a home for sale.
The scope of inference for this work is limited to house prices in Ames, IA during the time period this data was taken. Housing markets are very localized and can change drastically with time. Nonetheless, we believe the models to be generalizeable provided they are trained on data from the target population. Because it is an observational study, no causation can be implied. With more time or computing power, the authors believe there is room to fit even more complex models. For example, a complex linear regression including interaction terms in order to capture non-linear effects of variables. Or a random forest model with more trees and more variables.